Abstract
Keywords
Introduction
Findings from a 2022 study indicate that the transportation sector accounted for 27% of the energy consumption in the United States. 1 Specifically, petroleum (gasoline) consumption comprised about 52% of the total energy consumption, resulting in significant air pollutant emissions. This underscores the necessity for a well-designed traffic control system to mitigate fuel energy consumption (FEC) and air pollution for sustainability.2–4 The concept of sustainability has driven research into eco-driving strategies designed to reduce FEC rates (FEC within time). FEC rates can be calculated based on factors such as acceleration, mass, drag coefficient, rolling coefficient, driveline efficiency, idling speed, and idling fuel mean pressure.5,6 Reducing FEC involves two interconnected goals: shorter travel time and lower FEC rates. Vehicles incur the highest FEC rates during idling and frequent stops and starts, especially at traffic lights or in congestion. Therefore, prioritizing the establishment of a continuous traffic flow, characterized by minimal fluctuations in vehicle speeds, is essential for achieving lower FEC rates and shorter traffic delays. This approach is instrumental in promoting effective eco-driving strategies. 7
Traditional traffic control relies on fixed modes for traffic light changes and manual rerouting, resulting in limited efficiency and a lack of feedback mechanisms. The current setup of traffic control systems poses challenges in developing eco-driving strategies for hybrid traffic networks encompassing autonomous vehicle (AV) and human-driven vehicle (HDV). These challenges stem partially from the insufficient infrastructure for collecting, transmitting, and sharing real-time traffic data among vehicles, facilities, and traffic control centers, as well as the subsequent decision-making by involved agents. Furthermore, the intricate nature of the existing traffic networks, with their diverse array of vehicles and facilities, complicates the development of a mathematical model for accurately characterizing the traffic networks.
Current eco-driving strategies have addressed the challenges from various perspectives, including real-time artificial intelligence for traffic monitoring, and 5th generation (5G) communication networks to facilitate rapid information sharing.8–10 Due to the multifaceted nature of the eco-driving problem, a model-based deterministic strategy is challenging to approach. Meanwhile, data-driven approaches show promise, given the large amount of data accumulated during the past decades.
Related work on reinforcement learning in traffic control
Model-free reinforcement learning (RL) has demonstrated its advantage in decision-making for traffic flow control by examining interactions among multiple agents and the environment.10–12 RL has been applied to optimize vehicle routes for reduced delay and vehicle accelerations for less FEC.13,14 RL algorithms have also been developed to reduce air pollutant emissions by reducing vehicles’ waiting time at road intersections.15,16 In a study on infrastructure-to-vehicle communications networks, 17 a single vehicle was considered as an agent, and the Q-learning (QL) algorithm was developed to minimize carbon dioxide emissions. Additionally, a recent eco-driving framework based on the deep Q-network (DQN) approach was presented to enhance the fuel efficiency of multiple vehicles in a traffic network with one horizontal road and one vertical road. 18
In addition to applications of RL in controlling vehicle routes or acceleration, traffic lights are also considered as agents to control traffic flow with RL algorithms. An RL-based control has been developed for smart traffic signals, to reduce traffic jams and improve traffic smoothness in a traffic grid consisting of three horizontal and three vertical roads. 19
With more AV running on the road, they are also considered agents in RL algorithms for traffic control. In a recent study, a circular network with fixed traffic signal patterns at one spot was deployed to develop a deep deterministic policy gradient (DDPG) algorithm. The study aims to minimize the FEC of connected AV by controlling their acceleration. 20 Additionally, RL algorithms with a hybrid deep Q-learning and policy gradient (HDQPG) were developed to minimize the FEC of connected AV by controlling their acceleration in a traffic grid with one horizontal and five vertical roads. 21 Previous studies also explored a traffic flow containing both HDV and connected AV using trust region policy optimization (TRPO) to reduce the FEC and emissions of both HDV and CAV. 22
While the above-mentioned RL-based controls have improved traffic smoothness by focusing on the actions of vehicles or traffic lights, the effect of combining AVs and traffic signals on FEC has not been fully investigated.20–22
In this study, a novel eco-driving strategy was proposed by introducing a specific percentage of AV into the traffic flow of HDV in collaboration with smart traffic light signals to reduce the idling time of vehicles and improve the traffic smoothness in a scalable traffic network with user-defined horizontal and vertical roads for intersections. A model-free RL algorithm was developed to control the overall speed of all vehicles, resulting in a continuous traffic flow and reduction of the FEC of the vehicles in the network.
Method
The proposed RL algorithm determines the optimal actions of multiple agents, including AV and traffic lights in a dynamic traffic network with HDV to minimize the FEC rates of all vehicles. The traffic network and motion of all vehicles are simulated using the Simulation of Urban Mobility (SUMO) package. 23 The RL algorithm is implemented using Python and integrated into SUMO for simulation.
The selected traffic grid environment is inspired by the grid-like layout of Manhattan City. 24 Figures 1 and 2 display an open street map of the Manhattan traffic grid structure and its visualization in the SUMO environment, respectively.

Open street map of traffic grid structure in Manhattan City.

The grid structure of Manhattan City, simulated in the SUMO environment, is represented in the highlighted red color region. The selected traffic network serves as the basis for our research, examining the role of AV combined with HDV in minimizing the FEC rates of all vehicles in the traffic network.
Environment setup in SUMO
The traffic network is configured within an environment featuring
According to a recent study, AV account for 10% of all vehicles on the roads. 26 Accordingly, this study considers different penetration rates for AV (0%, 5%, 10%, and 20%) to assess their impact on traffic control. An RL controller is used to control RL agents, such as AV and traffic lights, with commands issued by policy at each time step. The speed and acceleration of AV are determined with an RL controller, while the motion of HDV is controlled by an embedded “sim car-following” controller in SUMO simulation. All vehicles are homogeneous with respect to their mass, size, and economic models.
At each intersection of two roads, four-way traffic lights are defined as actuated agents with a controllable period for red, green, and yellow lights. With the setup of
In this study, we focus on a 3×3 traffic network, assuming uniform road lengths in all directions to facilitate simulation. Figure 3 illustrates a network with

(a) 3×3 traffic light grid environment. (b) Four-way single signalized intersection.
Reinforcement learning
A decentralized, partially observable Markov decision process is adopted to coordinate the actions of agents, including traffic lights and AV in the traffic network. When vehicles move in the same direction, HDV are observable to an AV if the distance between a HDV and an AV is less than or equal to 25 m in the same lane. Each traffic light agent also observes the two nearest vehicles and has their information related to speed, distance to the intersection, and edge number. The position, speed, and acceleration of AV, as well as cycles and status of traffic lights, are shared among all AV and traffic lights.
The state, action space, policy, and reward function of the RL are defined as follows:
The state of each traffic light agent includes the time of the light's last change, the traffic flow direction controlled by the light (0 indicates passing with a green light, and 1 indicates stopping with a red light status), and the states of other traffic lights in the same traffic flow direction. At an intersection, if the top-bottom traffic lights have a status of “0,” the left-right traffic lights must have a status of “1,” and vice versa. When the status of a traffic light is green, it will change to yellow for 3 s before switching to red status due to safety purposes.
A stochastic policy
The average reward an agent receives at each time step while following a PPO policy is referred to as the average policy reward. The average policy reward, also defined as expected return of policy,
The advantage estimate function
The parameter λ impacts weights of potential rewards in the advantage estimation function
In the RL algorithm, the value function
Fuel energy consumption rate model
The function,
Other constant parameters in the model have been defined in Table 1.
FEC rate model parameters.
Computational framework
Two publicly available software packages, Flow and SUMO, are adopted in this study. Flow is a traffic control benchmarking framework developed in Python and integrates RL algorithms into different traffic control scenarios. 19 The SUMO simulator handles large-scale traffic networks based on physical-world data. Integration of SUMO and Flow package and implementation of the RL algorithm are shown in Figure 4.

Process diagram to describe RL training process and interactions between SUMO, Flow, and RLlib library. RL and Sim car following controllers used to control the AV and HDV, respectively. Sim car-following controller actions are entirely defined by the simulator, whereas RL-Controller performs actions by following commands from the policy in RLlib.
Results
The RL algorithm was applied to regulate the traffic flow in the selected traffic network with four different penetration rates of AV, 0%, 5%, 10%, and 20% (Figure 5).

Illustration of the traffic network with three vertical and three horizontal roads in SUMO simulator. An overview of all AV (red vehicles), observable HDV (green vehicles), and unobservable HDV (white vehicles) in the traffic network. (Bottom Left) A close view of traffic flow between intersections within the yellow box at the lower left part of the traffic network.
Training was conducted on a machine with 4 Intel® Core™ i5-6600 CPU @ 3.30 GHz. The hyperparameters used in the RL algorithms are listed in Table 2.
RL algorithm hyperparameters.
SGD: stochastic gradient descent; GAE: generalized Advantage estimation; KL: Kullback–Leibler.
The results of this study have been divided into four categories:
Rewards on total delay at different penetration levels; Rewards on FEC rates at different penetration levels; Performance of PPO policy at different penetration levels; Comparison of the selected 3×3 traffic network with other networks.
Rewards on total delay at different penetration levels
Various penetration rates of AV in the traffic flow are examined to optimize traffic flow considering information sharing on traffic lights and AV.
Figure 6 shows that traffic flow containing 100% HDV (i.e. 0% AV) has the worst total delay rewards in the long term compared to 5%, 10%, and 20% penetration rates of AV. Penetration of AVs at 5%, 10%, and 20% results in convergence of rewards on total delay. The 20% AV penetration rates show a more complicated learning process due to the priority of safety over the optimization of traffic flow speed and FEC rate at an early stage of the learning process. Once AV become familiar with the traffic flow patterns during the training period, the PPO algorithm improves the rewards on total delays due to good prediction of other vehicles’ behavior. The 10% penetration rate of AV indicates fluctuations as well during the training period and it could be due to fewer interactions between AV and HDV on the road.

Behavior of average rewards of total delay with respect to time steps for different AV penetration rates.
Table 3 shows the average delay for different penetration rates of AV in the selected traffic grid network. At a 10% pentation rate, the average reward was achieved at the least time step of 110 K as compared to other penetration rates.
Convergence time and steady-state average rewards on total delay obtained with different pentation rates.
FEC rates at different penetration levels
The FEC rate of a small-engine vehicle usually falls within the range of 0.05–0.10 L/s with an average driving velocity. The reward on FEC rate can reach zero when the vehicle achieves low levels of FEC at a minimum varying speed and other performance parameters. Results of the average rewards on FEC rate obtained from different penetration levels are presented in Figure 7 and Table 4. The penetration of AV shows better performance in reaching larger rewards on FEC rates, while the pure HDV case illustrates the worst scenario with a reward on FEC rate of about −1000. Interestingly, the 10% pentation rate regulates the FEC rate faster in the simulation compared to the results obtained from 5% and 20% penetration of AV. The time step for the convergence of the FEC rate and the steady-state average rewards on the FEC rate with four different penetration rates are presented in Table 4.

Behavior of average rewards on FEC rates with respect to time steps, considering four different penetration rates of AV in the traffic flow.
FEC rate results at different pentation rates.
Performance of PPO policy
To assess the effectiveness of the PPO policy in optimizing the FEC rate, the average policy reward and average environment time, policy loss, entropy, and value function loss have been evaluated with different pentation rates of AV. Figure 8 shows that average policy rewards with penetration rates 5%, 10%, and 20% of AV converge to zero while HDV only case has the worst reward with a value of −107.2. With the 10% penetration rate, the average policy rewards start to converge about 300 K steps, faster than 0%, 5%, and 20% penetration rates.

Behavior of average policy rewards with respect to time steps for four different penetration rates of AV.
The time for an agent to stay in a specific state before applying an action is considered an environment waiting time. The highest average environment time is observed for the 20% penetration rates at a time step of about 850 K as compared to others, as shown in Figure 9. The average environment waiting time of 5% case is slightly higher than that of the 10% AV penetration rate by the end of training, but its highest peak is observed to be higher than the 10% penetration rates during training about 850 K steps.

Average environment waiting time with respect to time steps for four different penetration rates of AV.
The entropy behavior for PPO is shown in Figure 10, with high values observed at a 0% penetration rate. A minimum value of entropy was observed at a 10% penetration rate by the end of training, indicating fewer uncertainties in policy distribution as compared to the 5% and 20% penetration rates.

Entropy of the policy to optimize FEC rates for four different penetration rates of AV.
The total loss, which is a combination of policy loss and value function loss, is depicted in Figure 11. At a 10% penetration rate, the minimum total loss is observed compared to other cases. All these performance indices show that penetration of AV can improve the rewards on FEC rates compared with 100% HDV traffic flow.

Total loss, including policy value function loss, with respect to time steps for four penetration rates of AV.
Table 5 shows the values of five policy measurements to optimize the FEC rate for four different penetration rates of AV.
FEC rate results at different pentation rates.
Comparison with other traffic grid environments
With a 10% penetration rate of AV, four traffic environments, including 1×1, 1×2, 2×2, and 3×3 traffic grids, were simulated with the proposed PPO algorithm. Figures 12 to 15 show the behavior of the average rewards on FEC rates with respect to time steps for each simulated environment, respectively. Table 6 presents the convergence of average rewards on FEC rates and the convergence time for each simulated environment. Specifically, the average FEC reward in the 3×3 traffic grid converged at about 216 K steps, which was less converging time than results obtained from other environments.

Behavior of average rewards on FEC rate with respect to time steps for a 1×1 traffic grid.

Behavior of average rewards on FEC rate with respect to time steps for a 1×2 traffic grid.

Behavior of average rewards on FEC rate with respect to time steps for a 2×2 traffic grid.

Behavior of average rewards on FEC rate with respect to time steps for a 3×3 traffic grid.
Performance of the PPO algorithm for rewards on FEC rates for different traffic grid networks.
Discussion
As more cars run on fuel like gasoline, resulting in air pollutants, the demand for eco-driving strategies is high.30,31 In this study, we employed an RL algorithm, PPO, to investigate the impact of introducing AV to the traffic flow of HDV on reducing traffic delay and minimizing FEC rates. This involved introducing a specific penetration rate of AV into a continuous traffic flow, coordinated with traffic light signals, within a large 3×3 traffic grid system. The Flow computational package, developed in Python, was utilized to integrate the publicly available microscopic traffic simulator, SUMO, and the RL library, “RLlib.” In a previous study, 32 a comparison between different types of action spaces for different algorithms was presented. Algorithms such as QL, DQN, DDPG, etc. are considered reliable for specific types of action spaces—either continuous or discrete. For environments that feature both continuous and discrete action spaces, PPO-based RL algorithms are feasible due to their computational and sample efficiency. So, the PPO-based RL approach is used in this research work to train agents in the selected traffic environment.
The environmental setup consists of a traffic network with three horizontal and three vertical roads in the SUMO simulator. Different percentages of AV (0%, 5%, 10%, and 20%) were introduced in this study to control the speed of HDV in the network. The average rewards on total delay and FEC rates were computed in this research work with different penetration rates. The penetration of AV illustrated better average rewards on both total delay and FEC rates. The 20% AV penetration initially results in more delays due to their prioritization of safety over speed and efficiency. Specifically, a 10% penetration rate in AV combined with HDV showed significant results for minimizing the FEC rate and total delay. The rewards on total delay for the 10% penetration rate case converged at a minimum value of −53.34 at the least time steps of 110 K in comparison with other cases. At a 0% penetration rate, an average reward on FEC rates of −964.3 was obtained by the end of training. For all other cases, the rewards on FEC rates approached zero by the end of training. To assess the performance of the PPO policy in training agents to minimize the FEC rates, results for average policy reward, entropy, value function loss, and mean environment time were obtained at various penetration rates. A 10% penetration rate demonstrated better performance compared to 0%, 5%, and 20% rates. A comparison of four traffic light grids (1×1, 1×2, 2×2, and 3×3) was performed at a penetration level of 10% in terms of the FEC rate. The results indicated that the average rewards on FEC rates converged in a shorter time for a 3×3 traffic network as compared to other configurations.
We are well aware that there are limitations in this study. PPO-based RL algorithms need to be precisely tuned to achieve the most effective learning results because they are hyperparameter-sensitive. The consideration of lane change was not incorporated into this research; lane-changing behavior can be discussed by introducing a lane change controller in the future. All HDV are assumed to have the same economic model, while heavy-duty vehicles, buses, cycles, and passenger vehicles have different economic models. To address this limitation, we can find an economical model for each type of vehicle, determine the percentage of each vehicle type based on the public traffic flow dataset, and integrate this information into the energy calculation in our future research. We assume ideal communication without any delay or failure in controlling AV in this study. In the future, we can apply RL algorithms to scaled-down AV to examine the impact of communication delays.
A comparison of the proposed eco-driving strategy is performed with prior related research, as shown in Table 7, suggesting the effectiveness of the proposed PPO algorithm for an eco-driving strategy.
Comparison with prior research work.
Conclusions
Eco-driving positively impacts human health by reducing pollution resulting from vehicle fuel consumption and emissions. This study explores a hybrid traffic network that combines AVs and HDVs through the coordination of traffic light signals to manage a large traffic flow. The approach addresses eco-driving challenges, including real-time traffic data collection and the intricate nature of the traffic network, which currently lacks a comprehensive mathematical model. The research employs model-free PPO-based RL algorithms to analyze the FEC rates of vehicles. It focuses on minimizing FEC by introducing specific penetration rates of AV (0%, 5%, 10%, and 20%) in a 3×3 traffic grid system, utilizing the Flow compactional package to integrate the SUMO simulator and RLlib. The study results indicate that a 10% penetration rate of AV alongside HDV yielded significant reductions in both fuel consumption and total traffic delay.
