We deployed 100 reinforcement studying (RL)-controlled vehicles into rush-hour freeway site visitors to clean congestion and scale back gas consumption for everybody. Our purpose is to sort out “stop-and-go” waves, these irritating slowdowns and speedups that normally haven’t any clear trigger however result in congestion and important power waste. To coach environment friendly flow-smoothing controllers, we constructed quick, data-driven simulations that RL brokers work together with, studying to maximise power effectivity whereas sustaining throughput and working safely round human drivers.
Total, a small proportion of well-controlled autonomous automobiles (AVs) is sufficient to considerably enhance site visitors circulation and gas effectivity for all drivers on the street. Furthermore, the skilled controllers are designed to be deployable on most fashionable automobiles, working in a decentralized method and counting on customary radar sensors. In our newest paper, we discover the challenges of deploying RL controllers on a large-scale, from simulation to the sector, throughout this 100-car experiment.
The challenges of phantom jams
A stop-and-go wave shifting backwards by way of freeway site visitors.
For those who drive, you’ve absolutely skilled the frustration of stop-and-go waves, these seemingly inexplicable site visitors slowdowns that seem out of nowhere after which all of a sudden clear up. These waves are sometimes brought on by small fluctuations in our driving habits that get amplified by way of the circulation of site visitors. We naturally regulate our pace primarily based on the automobile in entrance of us. If the hole opens, we pace as much as sustain. In the event that they brake, we additionally decelerate. However because of our nonzero response time, we would brake only a bit tougher than the automobile in entrance. The following driver behind us does the identical, and this retains amplifying. Over time, what began as an insignificant slowdown turns right into a full cease additional again in site visitors. These waves transfer backward by way of the site visitors stream, resulting in important drops in power effectivity because of frequent accelerations, accompanied by elevated CO2 emissions and accident threat.
And this isn’t an remoted phenomenon! These waves are ubiquitous on busy roads when the site visitors density exceeds a essential threshold. So how can we handle this drawback? Conventional approaches like ramp metering and variable pace limits try and handle site visitors circulation, however they usually require expensive infrastructure and centralized coordination. A extra scalable method is to make use of AVs, which may dynamically regulate their driving habits in real-time. Nonetheless, merely inserting AVs amongst human drivers isn’t sufficient: they need to additionally drive in a wiser means that makes site visitors higher for everybody, which is the place RL is available in.
Elementary diagram of site visitors circulation. The variety of vehicles on the street (density) impacts how a lot site visitors is shifting ahead (circulation). At low density, including extra vehicles will increase circulation as a result of extra automobiles can move by way of. However past a essential threshold, vehicles begin blocking one another, resulting in congestion, the place including extra vehicles truly slows down general motion.
Reinforcement studying for wave-smoothing AVs
RL is a strong management method the place an agent learns to maximise a reward sign by way of interactions with an setting. The agent collects expertise by way of trial and error, learns from its errors, and improves over time. In our case, the setting is a mixed-autonomy site visitors situation, the place AVs be taught driving methods to dampen stop-and-go waves and scale back gas consumption for each themselves and close by human-driven automobiles.
Coaching these RL brokers requires quick simulations with lifelike site visitors dynamics that may replicate freeway stop-and-go habits. To realize this, we leveraged experimental information collected on Interstate 24 (I-24) close to Nashville, Tennessee, and used it to construct simulations the place automobiles replay freeway trajectories, creating unstable site visitors that AVs driving behind them be taught to clean out.
Simulation replaying a freeway trajectory that displays a number of stop-and-go waves.
We designed the AVs with deployment in thoughts, guaranteeing that they’ll function utilizing solely fundamental sensor details about themselves and the automobile in entrance. The observations encompass the AV’s pace, the pace of the main automobile, and the house hole between them. Given these inputs, the RL agent then prescribes both an instantaneous acceleration or a desired pace for the AV. The important thing benefit of utilizing solely these native measurements is that the RL controllers will be deployed on most fashionable automobiles in a decentralized means, with out requiring further infrastructure.
Reward design
Essentially the most difficult half is designing a reward perform that, when maximized, aligns with the completely different aims that we want the AVs to realize:
- Wave smoothing: Scale back stop-and-go oscillations.
- Vitality effectivity: Decrease gas consumption for all automobiles, not simply AVs.
- Security: Guarantee affordable following distances and keep away from abrupt braking.
- Driving consolation: Keep away from aggressive accelerations and decelerations.
- Adherence to human driving norms: Guarantee a “regular” driving habits that doesn’t make surrounding drivers uncomfortable.
Balancing these aims collectively is troublesome, as appropriate coefficients for every time period have to be discovered. As an illustration, if minimizing gas consumption dominates the reward, RL AVs be taught to come back to a cease in the course of the freeway as a result of that’s power optimum. To stop this, we launched dynamic minimal and most hole thresholds to make sure protected and affordable habits whereas optimizing gas effectivity. We additionally penalized the gas consumption of human-driven automobiles behind the AV to discourage it from studying a egocentric habits that optimizes power financial savings for the AV on the expense of surrounding site visitors. Total, we intention to strike a steadiness between power financial savings and having an affordable and protected driving habits.
Simulation outcomes
Illustration of the dynamic minimal and most hole thresholds, inside which the AV can function freely to clean site visitors as effectively as attainable.
The everyday habits realized by the AVs is to keep up barely bigger gaps than human drivers, permitting them to soak up upcoming, presumably abrupt, site visitors slowdowns extra successfully. In simulation, this method resulted in important gas financial savings of as much as 20% throughout all street customers in essentially the most congested eventualities, with fewer than 5% of AVs on the street. And these AVs don’t need to be particular automobiles! They will merely be customary shopper vehicles outfitted with a sensible adaptive cruise management (ACC), which is what we examined at scale.
Smoothing habits of RL AVs. Crimson: a human trajectory from the dataset. Blue: successive AVs within the platoon, the place AV 1 is the closest behind the human trajectory. There may be sometimes between 20 and 25 human automobiles between AVs. Every AV doesn’t decelerate as a lot or speed up as quick as its chief, resulting in reducing wave amplitude over time and thus power financial savings.
100 AV discipline take a look at: deploying RL at scale


Our 100 vehicles parked at our operational heart throughout the experiment week.
Given the promising simulation outcomes, the pure subsequent step was to bridge the hole from simulation to the freeway. We took the skilled RL controllers and deployed them on 100 automobiles on the I-24 throughout peak site visitors hours over a number of days. This huge-scale experiment, which we known as the MegaVanderTest, is the biggest mixed-autonomy traffic-smoothing experiment ever carried out.
Earlier than deploying RL controllers within the discipline, we skilled and evaluated them extensively in simulation and validated them on the {hardware}. Total, the steps in the direction of deployment concerned:
- Coaching in data-driven simulations: We used freeway site visitors information from I-24 to create a coaching setting with lifelike wave dynamics, then validate the skilled agent’s efficiency and robustness in a wide range of new site visitors eventualities.
- Deployment on {hardware}: After being validated in robotics software program, the skilled controller is uploaded onto the automotive and is ready to management the set pace of the automobile. We function by way of the automobile’s on-board cruise management, which acts as a lower-level security controller.
- Modular management framework: One key problem throughout the take a look at was not getting access to the main automobile info sensors. To beat this, the RL controller was built-in right into a hierarchical system, the MegaController, which mixes a pace planner information that accounts for downstream site visitors situations, with the RL controller as the ultimate choice maker.
- Validation on {hardware}: The RL brokers had been designed to function in an setting the place most automobiles had been human-driven, requiring sturdy insurance policies that adapt to unpredictable habits. We confirm this by driving the RL-controlled automobiles on the street beneath cautious human supervision, making adjustments to the management primarily based on suggestions.

Every of the 100 vehicles is related to a Raspberry Pi, on which the RL controller (a small neural community) is deployed.

The RL controller immediately controls the onboard adaptive cruise management (ACC) system, setting its pace and desired following distance.
As soon as validated, the RL controllers had been deployed on 100 vehicles and pushed on I-24 throughout morning rush hour. Surrounding site visitors was unaware of the experiment, guaranteeing unbiased driver habits. Knowledge was collected throughout the experiment from dozens of overhead cameras positioned alongside the freeway, which led to the extraction of thousands and thousands of particular person automobile trajectories by way of a pc imaginative and prescient pipeline. Metrics computed on these trajectories point out a pattern of lowered gas consumption round AVs, as anticipated from simulation outcomes and former smaller validation deployments. As an illustration, we are able to observe that the nearer persons are driving behind our AVs, the much less gas they seem to eat on common (which is calculated utilizing a calibrated power mannequin):
Common gas consumption as a perform of distance behind the closest engaged RL-controlled AV within the downstream site visitors. As human drivers get additional away behind AVs, their common gas consumption will increase.
One other solution to measure the influence is to measure the variance of the speeds and accelerations: the decrease the variance, the much less amplitude the waves ought to have, which is what we observe from the sector take a look at information. Total, though getting exact measurements from a considerable amount of digicam video information is difficult, we observe a pattern of 15 to twenty% of power financial savings round our managed vehicles.
Knowledge factors from all automobiles on the freeway over a single day of the experiment, plotted in speed-acceleration house. The cluster to the left of the purple line represents congestion, whereas the one on the appropriate corresponds to free circulation. We observe that the congestion cluster is smaller when AVs are current, as measured by computing the realm of a tender convex envelope or by becoming a Gaussian kernel.
Last ideas
The 100-car discipline operational take a look at was decentralized, with no specific cooperation or communication between AVs, reflective of present autonomy deployment, and bringing us one step nearer to smoother, extra energy-efficient highways. But, there’s nonetheless huge potential for enchancment. Scaling up simulations to be sooner and extra correct with higher human-driving fashions is essential for bridging the simulation-to-reality hole. Equipping AVs with further site visitors information, whether or not by way of superior sensors or centralized planning, might additional enhance the efficiency of the controllers. As an illustration, whereas multi-agent RL is promising for bettering cooperative management methods, it stays an open query how enabling specific communication between AVs over 5G networks might additional enhance stability and additional mitigate stop-and-go waves. Crucially, our controllers combine seamlessly with current adaptive cruise management (ACC) programs, making discipline deployment possible at scale. The extra automobiles outfitted with sensible traffic-smoothing management, the less waves we’ll see on our roads, which means much less air pollution and gas financial savings for everybody!
Many contributors took half in making the MegaVanderTest occur! The total record is offered on the CIRCLES challenge web page, together with extra particulars concerning the challenge.
Learn extra: [paper]