Timeout paradox, or why is my bus always late?

3r31255. 3r3-31. Source : Wikipedia License CC-BY-SA ???r3r3439. 3r33440. 3r31243. 3r31255. 3r31243. 3r31255. If you often travel by public transport, then you probably met with this situation: 3r3-31243. 3r31255. 3r31243. 3r31255.

You come to a halt. It is written that the bus runs every 10 minutes. Notice the time Finally, after 11 minutes, the bus comes and thought: why do I always have bad luck?

3r31243. 3r31255. The idea is that if buses arrive every 10 minutes, and you arrive at a random time, then the average wait should be about 5 minutes. But in reality, buses do not arrive right on schedule, so you can wait longer. It turns out that with some reasonable assumptions one can come to a striking conclusion: 3r312343. 3r31255. 3r31243. 3r31255. ** While waiting for the bus, which comes on average every 10 minutes, your average waiting time will be 10 minutes. ** 3r31243. 3r31255. 3r31243. 3r31255. This is what is sometimes called

3x3r3439 timeout paradox. . 3r31243. 3r31255. Post on StackExchange Or This thread on Twitter ). 3r31243. 3r31255. 3r31243. 3r31255. The exponential distribution of intervals implies that the arrival time follows the Poisson process. To test this reasoning, we check the presence of another property of the Poisson process: that the number of arrivals during a fixed period of time is a Poisson distribution. To do this, we divide the simulated arrivals into watch blocks:

3r31255. 3r31243. 3r31255.

3r31199. from scipy.stats import poisson

3r31255. # count the number of arrivals in 1-hour bins

binsize = 60

binned_arrivals = np.bincount ((bus_arrival_times //binsize) .astype (int))

x = np.arange (20)

3r31255. # plot the results

plt.hist (binned_arrivals, bins = x - 0.? density = True, alpha = 0.? label = 'simulation') 3r31255. plt.plot (x, poisson (binsize /tau) .pmf (x), 'ok', label = 'Poisson prediction')

plt.xlabel ('Number of arrivals per hour')

plt.ylabel ('frequency')

plt.legend (); 3r31226. 3r31243. 3r31255. 3r33410. 3r33411. 3r31243. 3r31255. If we substitute the result in the previous formula, then we find the average waiting time for passengers at the bus stop: 3r312343. 3r31255. 3r31243. 3r31255. 3r3407. 3r33411. 3r3407. 3r3408.

3r31255. For flights arriving with the Poisson process, the expected waiting time is identical to the average interval between arrivals. 3r31243. 3r31255. 3r31243. 3r31255. One can argue about this problem in the following way: the Poisson process is a 3r3-33435 process. without memory [/i] , that is, the event history has nothing to do with the expected time of the next event. Therefore, on arrival at the bus stop, the average waiting time for a bus is always the same: in our case it is 10 minutes, no matter how much time has passed since the last bus! It does not matter how long you waited: the expected time until the next bus is always exactly 10 minutes: in the Poisson process you do not get a “credit” for the time spent waiting. 3r31243. 3r31255. 3r31243. 3r31255. 3r31233. The wait time is in reality

3r31243. 3r31255. The above is good if real bus arrivals are actually described by the Poisson process, but is it? 3r31243. 3r31255. 3r31243. 3r31255. 3r33432. 3r31243. 3r31255. 3r33434.

Source: 3r3437. Seattle's public transport scheme

3r33440. 3r31243. 3r31255. 3r31243. 3r31255. Let us try to determine how the paradox of waiting time is consistent with reality. For this we examine some of the data available for download here: arrival_times.csv (3 MB CSV file). The dataset contains the planned and actual arrival times for buses 3r3447. RapidRide

C, D and E at the 3rd & Pike bus stop in downtown Seattle. The data was recorded in the second quarter of 2016 (many thanks to Mark Hallenbek from the Washington State Transportation Center for this file!). 3r31243. 3r31255. 3r31243. 3r31255.

3r31199. import pandas as pd

df = pd.read_csv ('arrival_times.csv')

df = df.dropna (axis = ? how = 'any')

df.head () 3r31226. 3r31243. 3r31255.

3r31255. 3r3393979. 3r31255. 3r3-131125. 3r31255. 3rr31127. 3r?13?128. 3r31255. 3rr31127. OPD_DATE

3r31255. 3rr31127. VEHICLE_ID

3r31255. 3rr31127. RTE 3r3-31128. 3r31255. 3rr31127. DIR 3r3–31128. 3r31255. 3rr31127. TRIP_ID 3r3-31128. 3r31255. 3rr31127. STOP_ID

3r31255. 3rr31127. STOP_NAME

3r31255. 3rr31127. SCH_STOP_TM

3r31255. 3rr31127. ACT_STOP_TM

3r31255. 3r31151. 3r31255.

3r31255.

3r31255. 3r3-131125. 3r31255. 3rr31127. 0

3r31255. 3r31148. 2016-03-26

3r31255. 3r31148. 6201

3r31255. 3r31148. 673 3r31149. 3r31255. 3r31148. S

3r31255. 3r31148. 30908177

3r31255. 3r31148. 431

3r31255. 3r31148. 3RD AVE & PIKE ST (431)

3r31255. 3r31148. 01:11:57

3r31255. 3r31148. 01:13:19

3r31255. 3r31151. 3r31255. 3r3-131125. 3r31255. 3rr31127. 1 3r3-131128. 3r31255. 3r31148. 2016-03-26

3r31255. 3r31148. 6201

3r31255. 3r31148. 673 3r31149. 3r31255. 3r31148. S

3r31255. 3r31148. 30908033

3r31255. 3r31148. 431

3r31255. 3r31148. 3RD AVE & PIKE ST (431)

3r31255. 3r31148. 23:19:57

3r31255. 3r31148. 23:16:13

3r31255. 3r31151. 3r31255. 3r3-131125. 3r31255. 3rr31127. 2

3r31255. 3r31148. 2016-03-26

3r31255. 3r31148. 6201

3r31255. 3r31148. 673 3r31149. 3r31255. 3r31148. S

3r31255. 3r31148. 30908028

3r31255. 3r31148. 431

3r31255. 3r31148. 3RD AVE & PIKE ST (431)

3r31255. 3r31148. 21:19:57

3r31255. 3r31148. 21:18:46

3r31255. 3r31151. 3r31255. 3r3-131125. 3r31255. 3rr31127. 3 r3r31128. 3r31255. 3r31148. 2016-03-26

3r31255. 3r31148. 6201

3r31255. 3r31148. 673 3r31149. 3r31255. 3r31148. S

3r31255. 3r31148. 30908019

3r31255. 3r31148. 431

3r31255. 3r31148. 3RD AVE & PIKE ST (431)

3r31255. 3r31148. 19:04:57

3r31255. 3r31148. 19:01:49

3r31255. 3r31151. 3r31255. 3r3-131125. 3r31255. 3rr31127. 4

3r31255. 3r31148. 2016-03-26

3r31255. 3r31148. 6201

3r31255. 3r31148. 673 3r31149. 3r31255. 3r31148. S

3r31255. 3r31148. 30908252

3r31255. 3r31148. 431

3r31255. 3r31148. 3RD AVE & PIKE ST (431)

3r31255. 3r31148. 16:42:57

3r31255. 3r31148. 16:42:39

3r31255. 3r31151. 3r31255. 3r31153. 3r31255. 3r31155. 3r31243. 3r31255. I chose RapidRide data because, for most of the day, buses run at regular intervals of 10−15 minutes, not to mention the fact that I am a frequent passenger on the route C. 3r31243. 3r31255. 3r31243. 3r31255. 3r33939. Data cleaning 3r3951. 3r31243. 3r31255. To begin with, we will do a little data cleaning to convert them to a convenient form:

3r31255. 3r31243. 3r31255.

3r31199. # combine date and time into a single timestamp

df['scheduled']= pd.to_datetime (df['OPD_DATE']+ '' + df['SCH_STOP_TM']) 3r31255. df['actual']= pd.to_datetime (df['OPD_DATE']+ '' + df['ACT_STOP_TM']) 3r31255. 3r31255. # if scheduled

minute = np.timedelta64 (? 'm')

hour = 60 * minute

diff_hrs = (df['actual']- df['scheduled']) /hour

df.loc[diff_hrs > 20, 'actual']- = 24 * hour

df.loc[diff_hrs < -20, 'actual']+ = 24 * hour

df['minutes_late']= (df['actual']- df['scheduled']) /minute

3r31255. # map internal route codes to external route letters

df['route']= df['RTE'].replace ({673: 'C', 674: 'D', 675: 'E'}). astype ('category') 3r31255. df['direction']= df['DIR'].replace ({'N': 'northbound', 'S': 'southbound'}). astype ('category')

3r31255. # extract useful columns

df = df[['route', 'direction', 'scheduled', 'actual', 'minutes_late']].copy ()

3r31255. df.head () 3r31226. 3r31243. 3r31255.

3r31255. 3r3393979. 3r31255. 3r3-131125. 3r31255. 3rr31127. 3r?13?128. 3r31255. 3rr31127. Route 3r3-31128. 3r31255. 3rr31127. Direction 3r3-31128. 3r31255. 3rr31127. Chart 3r3-31128. 3r31255. 3rr31127. Fact. Arrival 3r3-31128. 3r31255. 3rr31127. Lateness (min) 3r3-31128. 3r31255. 3r31151. 3r31255.

3r31255.

3r31255. 3r3-131125. 3r31255. 3rr31127. 0

3r31255. 3r31148. C

3r31255. 3r31148. south

3r31255. 3r31148. 2016-03-???:11:57

3r31255. 3r31148. 2016-03-???:13:19

3r31255. 3r31148. ???r31149. 3r31255. 3r31151. 3r31255. 3r3-131125. 3r31255. 3rr31127. 1 3r3-131128. 3r31255. 3r31148. C

3r31255. 3r31148. south

3r31255. 3r31148. 2016-03-???:19:57

3r31255. 3r31148. 2016-03-???:16:13

3r31255. 3r31148. -???r3r31149. 3r31255. 3r31151. 3r31255. 3r3-131125. 3r31255. 3rr31127. 2

3r31255. 3r31148. C

3r31255. 3r31148. south

3r31255. 3r31148. 2016-03-???:19:57

3r31255. 3r31148. 2016-03-???:18:46

3r31255. 3r31148. -???r3r31149. 3r31255. 3r31151. 3r31255. 3r3-131125. 3r31255. 3rr31127. 3 r3r31128. 3r31255. 3r31148. C

3r31255. 3r31148. south

3r31255. 3r31148. 2016-03-???:04:57

3r31255. 3r31148. 2016-03-???:01:49

3r31255. 3r31148. -???r3r31149. 3r31255. 3r31151. 3r31255. 3r3-131125. 3r31255. 3rr31127. 4

3r31255. 3r31148. C

3r31255. 3r31148. south

3r31255. 3r31148. 2016-03-???:42:57

3r31255. 3r31148. 2016-03-???:42:39

3r31255. 3r31148. -???r3r31149. 3r31255. 3r31151. 3r31255. 3r31153. 3r31255. 3r31155. 3r31243. 3r31255. 3r31233. How late are the buses? 3r31234. 3r31243. 3r31255. There are six data sets in this table: north and south directions for each route C, D, and E. To get an idea of their characteristics, let's build a histogram of the actual minus of the planned arrival time for each of these six: 3r31243. 3r31255. 3r31243. 3r31255.

3r31199. import seaborn as sns

g = sns.FacetGrid (df, row = "direction", col = "route")

g.map (plt.hist, "minutes_late", bins = np.arange (-1? 20))

g.set_titles ('{col_name} {row_name}')

g.set_axis_labels ('minutes late', 'number of buses'); 3r31226. 3r31243. 3r31255. 3r33879. 3r31243. 3r31255. 3r31243. 3r31255. It is logical to assume that the buses closer to the schedule at the beginning of the route and more deviate from it to the end. The data confirms this: our stop on the southern route C, as well as on the northern D and E is close to the beginning of the route, and in the opposite direction - close to the final destination. 3r31243. 3r31255. 3r31243. 3r31255. 3r33939. Scheduled and observed intervals

3r31243. 3r31255. Look at the observed and planned intervals between buses for these six routes. Let's start with function 3r3r9292. groupby in Pandas to calculate these intervals:

3r31255. 3r31243. 3r31255.

3r31199. def compute_headway (scheduled):

minute = np.timedelta64 (? 'm')

Return scheduled.sort_values (). diff () /minute

3r31255. grouped = df.groupby (['route', 'direction'])

df['actual_interval']= grouped['actual'].transform (compute_headway)

df['scheduled_interval']= grouped['scheduled'].transform (compute_headway) 3r31226. 3r31243. 3r31255.

3r31199. g = sns.FacetGrid (df.dropna (), row = "direction", col = "route")

g.map (plt.hist, "actual_interval", bins = np.arange (50) + 0.5)

g.set_titles ('{col_name} {row_name}')

g.set_axis_labels ('actual interval (minutes)', 'number of buses'); 3r31226. 3r31243. 3r31255. 3r33939. 3r31243. 3r31255. 3r31243. 3r31255. It is already clear that the results are not very similar to the exponential distribution of our model, but this still says nothing: non-constant intervals in the graph can affect the distribution. 3r31243. 3r31255. 3r31243. 3r31255. Repeat the construction of the diagrams, taking the planned, rather than the observed arrival intervals: 3r312343. 3r31255. 3r31243. 3r31255.

3r31199. g = sns.FacetGrid (df.dropna (), row = "direction", col = "route")

g.map (plt.hist, "scheduled_interval", bins = np.arange (20) - 0.5)

g.set_titles ('{col_name} {row_name}')

g.set_axis_labels ('scheduled interval (minutes)', 'frequency'); 3r31226. 3r31243. 3r31255. 3r33941. 3r31243. 3r31255. 3r31243. 3r31255. This shows that during the week buses run at different intervals, so that we cannot assess the accuracy of the waiting time paradox for real information from a stop. 3r31243. 3r31255. 3r31243. 3r31255. 3r33939. Build homogeneous schedules

3r31243. 3r31255. Although the official schedule does not give uniform intervals, there are several specific time intervals with a large number of buses: for example, almost 2000 buses of route E to the north with a scheduled interval of 10 minutes. To find out if the wait time paradox is applicable, let's group the data by route, route, and planned interval, and then re-add it as if it happened sequentially. This should preserve all relevant characteristics of the source data, while making it easier to compare directly with the predictions of the waiting time paradox. 3r31243. 3r31255. 3r31243. 3r31255.

3r31199. def stack_sequence (data):

# first, sort by scheduled time

data = data.sort_values ('scheduled')

3r31255. # re-stack data & recompute relevant orders

data['scheduled']= data['scheduled_interval'].cumsum ()

data['actual']= data['scheduled']+ data['minutes_late']3r31255. data['actual_interval']= data['actual'].sort_values (). diff ()

return data

3r31255. subset = df[df.scheduled_interval.isin([10, 12, 15])]3r31255. grouped = subset.groupby (['route', 'direction', 'scheduled_interval'])

sequenced = grouped.apply (stack_sequence) .reset_index (drop = true)

sequenced.head () 3r31226. 3r31243. 3r31255.

3r31255. 3r3393979. 3r31255. 3r3-131125. 3r31255. 3rr31127. 3r?13?128. 3r31255. 3rr31127. Route 3r3-31128. 3r31255. 3rr31127. Direction 3r3-31128. 3r31255. 3rr31127. Schedule

3r31255. 3rr31127. Fact. Arrival 3r3-31128. 3r31255. 3rr31127. Lateness (min) 3r3-31128. 3r31255. 3rr31127. Fact. interval 3r3-31128. 3r31255. 3rr31127. Schedule interval 3r3-31128. 3r31255. 3r31151. 3r31255.

3r31255.

3r31255. 3r3-131125. 3r31255. 3rr31127. 0

3r31255. 3r31148. C

3r31255. 3r31148. North

3r31255. 3r31148. ???r3r31149. 3r31255. 3r31148. ???r3r31149. 3r31255. 3r31148. ???r3r31149. 3r31255. 3r31148. NaN

3r31255. 3r31148. ???r3r31149. 3r31255. 3r31151. 3r31255. 3r3-131125. 3r31255. 3rr31127. 1 3r3-131128. 3r31255. 3r31148. C

3r31255. 3r31148. North

3r31255. 3r31148. ???r31149. 3r31255. 3r31148. ??? r3r31149. 3r31255. 3r31148. ???r31149. 3r31255. 3r31148. ???r3r31149. 3r31255. 3r31148. ???r3r31149. 3r31255. 3r31151. 3r31255. 3r3-131125. 3r31255. 3rr31127. 2

3r31255. 3r31148. C

3r31255. 3r31148. North

3r31255. 3r31148. ???r3r31149. 3r31255. 3r31148. ???r31149. 3r31255. 3r31148. -???r3r31149. 3r31255. 3r31148. ???r31149. 3r31255. 3r31148. ???r3r31149. 3r31255. 3r31151. 3r31255. 3r3-131125. 3r31255. 3rr31127. 3 r3r31128. 3r31255. 3r31148. C

3r31255. 3r31148. North

3r31255. 3r31148. ???r31149. 3r31255. 3r31148. ???r31149. 3r31255. 3r31148. -???r3r31149. 3r31255. 3r31148. ???r31149. 3r31255. 3r31148. ???r3r31149. 3r31255. 3r31151. 3r31255. 3r3-131125. 3r31255. 3rr31127. 4

3r31255. 3r31148. C

3r31255. 3r31148. North

3r31255. 3r31148. ???r31149. 3r31255. 3r31148. ???r3r31149. 3r31255. 3r31148. ???r3r31149. 3r31255. 3r31148. ???r31149. 3r31255. 3r31148. ???r3r31149. 3r31255. 3r31151. 3r31255. 3r31153. 3r31255. 3r31155. 3r31243. 3r31255. On the cleared data, you can make a graph of the distribution of the actual appearance of buses for each route and direction with a frequency of arrival: 3r312343. 3r31255. 3r31243. 3r31255.

3r31199. for route in['C', 'D', 'E']:

g = sns.FacetGrid (sequenced.query (f "route == '{route}'"),

row = "direction", col = "scheduled_interval")

g.map (plt.hist, "actual_interval", bins = np.arange (40) + 0.5)

g.set_titles ('{row_name} ({col_name: .0f} min)')

g.set_axis_labels ('actual interval (min)', 'count')

g.fig.set_size_inches (? 4)

g.fig.suptitle (f '{route} line', y = 1.0? fontsize = 14) 3r31226. 3r31243. 3r31255. 3r31175. 3r31243. 3r31255. 3r31243. 3r31255. 3r31180. 3r31243. 3r31255. 3r31243. 3r31255. 3r31185. 3r31243. 3r31255. 3r31243. 3r31255. We see that for each route the distribution of the observed intervals is almost Gaussian. It reaches a maximum near the planned interval and has a standard deviation that is less at the beginning of the route (south for C, north for D /E) and more at the end. Even by sight it can be seen that the actual arrival intervals definitely do not correspond to the exponential distribution, which is the basic assumption on which the waiting time paradox is based. 3r31243. 3r31255. 3r31243. 3r31255. We can take the simulated wait time function that we used above to find the average wait time for each bus route, route, and schedule: 3r31243. 3r31255. 3r31243. 3r31255.

3r31199. grouped = sequenced.groupby (['route', 'direction', 'scheduled_interval'])

sims = grouped['actual'].apply (simulate_wait_times)

sims.apply (lambda times: "{0: .1f} +/- {1: .1f}". format (times.mean (), times.std ())) 3r31226. 3r31243. 3r31255.

route direction interval on schedule

C North ???.8 +/- ???r3r31255. ???.4 +/- ???r3r31255. ???.8 +/- ???r3r31255. South ???.2 +/- ???r3r31255. ???.8 +/- ???r3r31255. ???.4 +/- ???r3r31255. D North ???.1 +/- ???r3r31255. ???.5 +/- ???r3r31255. ???.9 +/- ???r3r31255. South ???.7 +/- ???r3r31255. ???.5 +/- ???r3r31255. ???.8 +/- ???r3r31255. E North ???.5 +/- ???r3r31255. ???.5 +/- ???r31255. ???.9 +/- ???r3r31255. South ???.8 +/- ???r3r31255. ???.3 +/- ???r3r31255. ???.7 +/- ???r3r31255. Name: actual, dtype: object

3r31243. 3r31255. The average wait time is perhaps a minute or two more than half the planned interval, but not equal to the planned interval, as the waiting time paradox implies. In other words, the inspection paradox is confirmed, but the waiting time paradox is not true. 3r31243. 3r31255. 3r31243. 3r31255. 3r31233. Conclusion 3r31234. 3r31243. 3r31255. The waiting time paradox was an interesting starting point for discussion, which included modeling, probability theory, and comparing statistical assumptions with reality. Although we have confirmed that in the real world, bus routes are subject to some kind of inspection paradox, the above analysis quite convincingly shows: the basic assumption underlying the waiting time paradox - that the arrival of buses follows the statistics of the Poisson process - is not reasonable. 3r31243. 3r31255. 3r31243. 3r31255. Looking back, this is not surprising: the Poisson process is a memoryless process, which assumes that the probability of arrival is completely independent of the time since the previous arrival. In fact, in a well-managed public transport system there are specially structured timetables to avoid such behavior: buses do not start their routes at random times during the day, but start according to the schedule chosen for the most efficient transportation of passengers. 3r31243. 3r31255. 3r31243. 3r31255. The more important lesson is that you should be careful about the assumptions you make to any data mining task. Sometimes the Poisson process is a good description for the arrival time data. But just because one data type sounds like another data type does not mean that the assumptions valid for one are necessarily valid for another. Often, assumptions that seem correct can lead to conclusions that are not true. 3r31251. 3r31255. 3r31255. 3r31248. ! function (e) {function t (t, n) {if (! (n in e)) {for (var r, a = e.document, i = a.scripts, o = i.length; o-- ;) if (-1! == i[o].src.indexOf (t)) {r = i[o]; break} if (! r) {r = a.createElement ("script"), r.type = "text /jаvascript", r.async =! ? r.defer =! ? r.src = t, r.charset = "UTF-8"; var d = function () {var e = a.getElementsByTagName ("script")[0]; e.parentNode.insertBefore (r, e)}; "[object Opera]" == e.opera? a.addEventListener? a.addEventListener ("DOMContentLoaded", d,! 1): e.attachEvent ("onload", d ): d ()}}} t ("//mediator.mail.ru/script/2820404/"""_mediator") () (); 3r31249. 3r31255. 3r31251. 3r31255. 3r31255. 3r31255. 3r31255.

It may be interesting

#### weber

Author**3-11-2018, 03:22**

Publication Date
#### Data Mining / Algorithms / Mathematics

Category- Comments: 0
- Views: 284