How science inspires our ETA models

[

Mohamad Elmasri

](https://medium.com/@elmasri.m?source=post_page---byline--bf229e3148e8---------------------------------------)

Have you ever driven alongside another vehicle for an extended period? You’ve likely experienced this peculiar phenomenon: despite sharing the same route and traffic signals, you inevitably encounter a red light while the other vehicle passes through seconds earlier. For a moment, you might think they’ll reach their destination first. However, as you continue, you’re surprised to find them waiting at another red light just a few blocks ahead. This dance continues for a while. The ‘notes’ of this ‘song’ are the micro-random events inherent in traffic: a flock of pigeons crossing the road, a cyclist approaching, a sudden lane change by the vehicle in front. Some factors are more deterministic: weather conditions, road closures, or construction delays, others less so.

As a seasoned data scientist, your mission is to uncover hidden patterns within chaotic systems and translate them into mathematical insights. These insights inform the decisions, both big and small, of engineering and science organizations, and support its continual operational strategy. This blog translates a seemingly random traffic pattern into a comprehensible behavior, which we will use to build a statistical model for travel time.

Let’s get back to the core question: how do seemingly chaotic patterns help us build models? One observation we repeatedly made is that the distance of a ride significantly influences our understanding of travel time uncertainty. The longer you are on the road, the more chance you can expect traffic, this leads to the belief that travel uncertainty is substantially higher for longer rides. However, we found the opposite: travel time predictions are often more accurate for longer journeys. For instance, ETAs between your house and the next town tend to be more precise than those for a trip to the coffee shop next door. This relative travel uncertainty is particularly pronounced for short rides during rush hour, where unexpected traffic can significantly impact your ETA to the coffee shop. Conversely, when driving to the next city, the accumulated delays from rush hour congestion tend to smooth out over the longer distance. Therefore, two rides following the same route can ultimately arrive at similar times. If a ride encounters traffic at a specific spot, the other ride is also likely to encounter traffic at another spot. Where traffic is encountered might differ for different rides, yet the likelihood is similar. This phenomenon is more likely to occur with longer journeys.

What we’ve just described is a macro-level example of the ‘traffic light dance’ we discussed at the beginning of this blog. Rush hour is simply a larger-scale traffic event compared to a flock of pigeons crossing the road. We can visualize this phenomenon by plotting the trajectories of two rides taking a long route. If we represent congestion (like traffic lights) with red circles, these circles will appear at different points along each ride’s trajectory, leading to short-term variations in travel time.

Two vehicles traveling a 100-edge route, with τi(e) , i=1,2, denoting the travel times up to road e. Red circles represent congestion events.

As illustrated in the figure above, the travel time up to road ‘e’ varies between the two rides. However, these short-term discrepancies tend to even out over longer distances, as demonstrated by the occasional intersections of the two curves, visually representing the ‘dance’ we described earlier.

This insight is valuable, but how can we translate it into a more practical framework? First, let’s express it mathematically. These findings suggest that the difference in travel time between rides following the same route may possess strong statistical properties. While individual segments of a trip may exhibit randomness, the cumulative effect of this randomness can be highly predictable and statistically well-behaved.

Let’s expand our analysis by examining a larger number of rides and observing the impact on average travel time along the route. We’ll track the cumulative average travel time over the first 1, 10, 50, and 100 kilometers. The average is calculated by dividing the total travel time by the number of road segments (edges) within the respective distance. This allows us to visualize how travel time fluctuations change as the distance increases, demonstrating the shift from short-term variability to more consistent long-term trends.

Given our above intuition, it’s unsurprising that the average travel time converges as the distance increases. The first kilometer exhibits a heavily skewed distribution with significant variability in travel times. This is because a single congestion event within a 1-kilometer route can have a substantial impact on the overall travel time on the first kilometer. However, the distribution of average travel time per road segment gradually becomes more symmetric.

This pattern reveals two key insights:

Travel time along a route converges towards an asymptotic (long-term) distribution as distance increases.
Aggregating travel data from multiple trips to estimate road segment speeds can approximate this distribution.

What is the nature of this asymptotic distribution, and how can we formalize it mathematically? Essentially, we’ve been examining rides along a consistent route, calculating the average travel time per road segment. This involves summing the travel times for each segment along the route and dividing by the number of road segments (n), which serves as a proxy for distance.

Doesn’t this pattern resemble the Central Limit Theorem (CLT), which states that the sum of independent, identically distributed random variables converges to a normal distribution? Can we apply a similar principle here? While our variables (travel times across road segments) are not strictly independent, we observe an empirical convergence towards normality. This suggests that, even though individual road segment travel times exhibit variability due to stochastic congestion, their aggregated effect across numerous segments approximates a normal distribution. That is,

Test

This is a significant finding about the long-term statistical stability of traffic patterns.

Formally satisfying the CLT conditions is mathematically complex, as it requires more in-depth analysis of travel time, understanding the type of dependencies amount road segments, and how to deal with them mathematically. A great topic for our next blog in the series!

For now, let’s further validate our intuition with data by looking at one of our most frequent routes between 8–9AM in the Bay Area. This ride goes from Howard St and 5th St to 17th St and De Haro St, in San Francisco.