Maneuver Decision-Making through Automatic Curriculum Reinforcement Learning without Handcrafted Reward Functions

Wei, Yujie; Zhang, Hongpeng; Wang, Yuan; Huang, Changqiang

doi:10.3390/app13169421

Open AccessArticle

Maneuver Decision-Making through Automatic Curriculum Reinforcement Learning without Handcrafted Reward Functions

¹

Aeronautics Engineering College, Air Force Engineering University, Xi’an 710038, China

²

Air Defence and Antimissile College, Air Force Engineering University, Xi’an 710051, China

³

Air Force Xi’an Flying College, Xi’an 710300, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(16), 9421; https://doi.org/10.3390/app13169421

Submission received: 17 May 2023 / Revised: 10 August 2023 / Accepted: 15 August 2023 / Published: 19 August 2023

(This article belongs to the Special Issue Intelligent Unmanned System Technology and Application)

Download

Browse Figures

Versions Notes

Abstract

:

Maneuver decision-making is essential for autonomous air combat. However, previous methods usually make decisions to aim at the target instead of hitting the target and use discrete action spaces instead of continuous action spaces. While these simplifications make maneuver decision-making easier, they also make maneuver decision-making more unrealistic. Meanwhile, previous studies usually rely on handcrafted reward functions, which are troublesome to design. Therefore, to solve these problems, we propose an automatic curriculum reinforcement learning method that enables agents to maneuver effectively in air combat from scratch. On the basis of curriculum reinforcement learning, maneuver decision-making is divided into a series of sub-tasks from easy to difficult. Thus, agents can gradually learn how to complete a series of sub-tasks, from easy to difficult without handcrafted reward functions. The ablation studies show that automatic curriculum learning is essential for reinforcement learning; namely, agents cannot make effective decisions without curriculum learning. Simulations show that, after training, agents are able to make effective decisions given different states, including tracking, attacking, and escaping, which are both rational and interpretable.

Keywords:

reinforcement learning; curriculum learning; maneuver decision-making; unmanned combat aerial vehicle; sparse rewards

1. Introduction

Autonomous maneuver decision-making is a sequential decision-making problem. In air combat, the goal of agents is to maneuver according to different states and launch missiles to defeat opponents. Usually, decisions are made by a human pilot. However, with the development of artificial intelligence, there have been many studies regarding programs or algorithms [1,2] as virtual pilots to make decisions in simulation environments, which aims to realize autonomous decision-making for unmanned combat aerial vehicles in the future.

For example, Hu [3] proposed an improved deep Q-network [4] for air combat maneuver decision-making, verifying the feasibility of deep reinforcement learning (RL) for air combat maneuver decision-making. Eloy et al. [5] applied game theory to the confrontation process of air combat and proposed a differential game method combined with the missile attack area [6] in order to attack the static high-value targets. Dantas et al. [7] compared different methods of evaluating the most effective moment for launching missiles during air combat. They found that supervised learning on the basis of simulated data can promote the flight quality in beyond-visual-range air combat and increase the likelihood of hitting the desired targets. Fan et al. [8] used asynchronous advantage actor critic algorithm [9] to address the problem of air combat maneuver decision-making. They proposed a two-layer reward mechanism, including internal rewards and sparse rewards. The simulation results indicated that the method could reduce the correlation between samples through asynchronous training. Wang et al. [10] adopted deep deterministic policy gradient for beyond-visual-range air combat and validated the effectiveness of the method in simulations. Huang et al. [11] applied Bayesian inference and moving horizon optimization to air combat maneuver decision-making. The method adjusted the weights of maneuver decision factors by Bayesian inference theory, and then computed the control quantities by moving horizon optimization. Ide et al. [12] proposed a hierarchical maximum-entropy reinforcement learning with reward shaping on the basis of expert knowledge. This approach achieved the second place among eight competitors in the final DARPA’ s AlphaDogfight Trials event.

Although these studies are valuable and lay foundation for replacing human pilots with computers in real environments, there are still some shortcomings. First, in order to simplify the complexity of decision-making, many studies use discrete action spaces, whereas real action spaces are continuous. Second, in these studies, hitting the target by a missile is usually simplified as aiming at the target. However, in reality, even if the target has been locked, it may not be hit. Finally, these studies use handcrafted reward functions; namely, the agent obtains a non-zero reward at each time step. The disadvantage of handcrafted reward functions is that they may not be necessary, and the design of these functions may cost lots of effort and time.

On the other hand, Graves et al. [13] introduced a method of automatically selecting curricula based on the increase rate of prediction accuracy and network complexity in order to improve the learning efficiency. Matiisen et al. [14] introduced teacher–student curriculum learning, in which the former selects sub-tasks for the latter and the latter attempts to solve these sub-tasks. Automatic curriculum learning methods also produced great performance in addition of decimal numbers [15] and Minecraft [16]. A goal proposal module was introduced in [17]. This method prefers goals that can decrease the certainty of the Q-function. The authors investigated the certainty through thirteen robotic tasks and five navigation tasks and illustrated great performance of the method.

Castells et al. [18] created a new method which automatically prioritizes samples with a little loss to efficiently fulfill the core mechanism of curriculum learning without changing the training procedure. Experimental results on several different computer vision tasks indicate great performance. Self-paced curriculum learning approach is introduced in [19], which takes into account both the prior knowledge known before training and the learning process. Stretcu et al. [20] proposed a new curriculum learning method that decomposes challenging tasks into intermediate target sequences for pre-training the mode. The results showed that the classification accuracy of the method on the data set improved by 7%. Sukhbaatar et al. [21] proposed an automatic curriculum learning method based on asymmetric self-play. The method assigns two different minds to an agent: Alice and Bob. Alice aims to provide a task, and Bob aims to fulfill the task. The core is that Bob can comprehend the environment and fulfill the final task faster by self-play.

Rane used curriculum learning to solve tasks with sparse rewards [22]. The experimental results showed that curriculum learning can improve the performance of agents. Pascal et al. [23] interpreted the curricula as a task distribution sequence interpolated between the auxiliary task distribution and the target task distribution and framed the generation of a curriculum as a constrained optimal transport problem between task distributions. Wu et al. [24] proposed bootstrapped opportunistic adversarial curriculum learning, which opportunistically skips forward in the curriculum if the model of the current phase is already robust. Huang et al. [25] proposed GRADIENT, which formulates curriculum reinforcement learning as an optimal transport problem with a tailored distance metric between tasks.

Therefore, we propose an air combat maneuver decision-making method called automatic curriculum reinforcement learning (ACRL), and the main contributions are: 1. ACRL can make the missile hit the target instead of just aiming at the target. Since hitting the target by a missile is usually simplified as aiming at the target in most previous research. However, even if the target has been locked, it may not be hit. 2. ACRL does not require any human data or handcrafted reward functions, but only uses the results of air combat as the reward function. Because reward design is nonexistent, ACRL can save plenty of time and effort. 3. ACRL uses continuous action spaces in maneuvering decision-making instead of discrete action spaces, which are not realistic. 4. ACRL is verified by ablation studies, and the ability of decision-making of the trained agent is demonstrated by simulations.

2. Method

2.1. Aircraft Model and Missile Model

The aircraft model is listed as follows [26]:

\begin{array}{l} \{\begin{cases} \dot{x} = v \cos γ \cos ψ \\ \dot{y} = v \cos γ \sin ψ \\ \dot{z} = v \sin γ \\ \dot{v} = g (n_{x} - \sin γ) \\ \dot{γ} = \frac{g}{v} (n_{z} \cos μ - \cos γ) \\ \dot{ψ} = \frac{g}{v \cos γ} n_{z} \sin μ \end{cases} \\ n_{x} \in [- 0.5, 1.5] \\ n_{z} \in [- 3, 9] \\ μ \in [- π, π] \end{array}

(1)

where x, y, and z are three-dimensional coordinates of the aircraft.

γ

and

ψ

are pitch angle and yaw angle, respectively. v represents the aircraft speed. g is the gravitational acceleration.

μ

,

n_{x}

, and

n_{z}

are control signals. The missile model is [27]:

\{\begin{cases} {\dot{x}}_{m} = v_{m} \cos γ_{m} \cos ψ_{m} \\ {\dot{y}}_{m} = v_{m} \cos γ_{m} \sin ψ_{m} \\ {\dot{z}}_{m} = v_{m} \sin γ_{m} \end{cases}

(2)

\{\begin{cases} {\dot{v}}_{m} = \frac{(P_{m} - Q_{m}) g}{G_{m}} - g \sin γ_{m} \\ {\dot{ψ}}_{m} = \frac{n_{m c} g}{v_{m} \cos γ_{m}} \\ {\dot{γ}}_{m} = \frac{n_{m h} g}{v_{m}} - \frac{g \cos γ_{m}}{v_{m}} \end{cases}

(3)

where x_m, y_m, and z_m are three-dimensional coordinates of the missile. v_m represents the missile speed.

γ_{m}

and

ψ_{m}

are pitch angle and yaw angle, respectively. n_mc and n_mh are control signals. P_m, Q_m, and G_m are thrust, resistance, and mass, respectively:

P_{m} = \{\begin{array}{l} \begin{array}{l} P_{0} & t \leq t_{w} \end{array} \\ \begin{array}{l} 0 & t > t_{w} \end{array} \end{array}

(4)

Q_{m} = \frac{1}{2} ρ v_{m}^{2} S_{m} C_{D m}

(5)

G_{m} = \{\begin{matrix} \begin{matrix} G_{0} - G_{t} t & t \leq t_{w} \end{matrix} \\ \begin{matrix} G_{0} - G_{t} t_{w} & t > t_{w} \end{matrix} \end{matrix}

(6)

where t_w = 12.0 s,

ρ

= 0.607, S_m = 0.0324, and C_Dm = 0.9. P₀ is the average thrust, G₀ is the initial mass, and G_t is the rate of flow of fuel. K is the guidance coefficient of proportional guidance law.

\{\begin{cases} n_{m c} = K \cdot \frac{v_{m} \cos γ_{t}}{g} [\dot{β} + \tan ε \tan (ε + β) \dot{ε}] \\ n_{m h} = \frac{v_{m}}{g} \frac{K}{\cos (ε + β)} \dot{ε} \end{cases}

(7)

\{\begin{cases} β = \arctan (r_{y} / r_{x}) \\ ε = \arctan (r_{z} / \sqrt{r_{x}^{2} + r_{y}^{2}}) \end{cases}

(8)

\{\begin{array}{l} \dot{β} = ({\dot{r}}_{y} r_{x} - r_{y} {\dot{r}}_{x}) / (r_{x}^{2} + r_{y}^{2}) \\ \dot{ε} = \frac{(r_{x}^{2} + r_{y}^{2}) {\dot{r}}_{z} - r_{z} ({\dot{r}}_{x} r_{x} + {\dot{r}}_{y} r_{y})}{R^{2} \sqrt{r_{x}^{2} + r_{y}^{2}}} \end{array}

(9)

where n_mc and n_mh are control commands of the missile.

β

and

ε

are yaw angle and pitch angle of line-of-sight, respectively. The line-of-sight vector is the distance vector r, where

r_{x} = x_{t} - x_{m}, r_{y} = y_{t} - y_{m}, r_{z} = z_{t} - z_{m}

, and

R = ‖r‖ = \sqrt{r_{x}^{2} + r_{y}^{2} + r_{z}^{2}}

. If the missile has not hit the target after 27 s, the target is regarded as missed. If the azimuth angle of the target relative to the aircraft exceeds 60° (off-axis angle) when at the launch time, the target is regarded as missed. The conditions for the end of simulations are: 1. One of both sides in air combat is hit by the missile (the missing distance is 30 m); 2. The missiles of the both sides miss their targets; 3. The simulation time has reached 100 s. Meanwhile, we do not use any handcrafted reward functions; namely, the reward obtained by the agent is 1 if it defeats the opponent, −1 if it is defeated by the opponent, and 0 in other cases.

2.2. Proximal Policy Optimization

Policy gradient method tries to calculate an estimate of policy gradient and uses it for stochastic gradient ascent [28]. Both trust region policy optimization (TRPO) [29] and proximal policy optimization (PPO) [30] are policy gradient methods. PPO is built on TRPO and is more effective. The objective function of TRPO is constrained by the policy update. After linear approximation of the objective and quadratic approximation of the constraint, the problem can be approximately solved by the conjugate gradient method. However, TRPO is complex and incompatible with noise and parameter sharing.

PPO is a better method that achieves the data efficiency and reliable performance of TRPO, and it only uses first-order optimization. Schulman proposed a new target with a clipped probability ratio to generate an estimate of the lower bound of policy performance. The standard policy gradient method performs an update on the sampled data while PPO alternates between sampling data from the policy and performing several times of optimization on the sampled data. Schulman compared different surrogate targets, including not clipped target, clipped target, and KL penalties (with fixed or variable coefficients).

Specifically, PPO modifies the objective function to penalize policy changes that cause the probability ratio to deviate from 1. The objective function of PPO is:

\begin{array}{l} L^{C L I P} (θ) = Ε [\min (r_{t} (θ) A_{t}, clip (r_{t} (θ), 1 - ε, 1 + ε) A_{t})] \\ P_{r} (θ) = \frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{o l d}} (a_{t} | s_{t})} . \end{array}

(10)

By clipping the probability ratio and modifying the surrogate target, PPO can prevent excessive updates and make the optimization process more concise and robust than TRPO.

PPO is used in self-play to train air combat agents. Figure 1 shows the self-play architecture of training agents. An action or maneuver is output by the policy and applied to the air combat environment. Then, a transition consisting of states, actions, and rewards is obtained and stored in the experience pool. After several episodes, transitions are sampled from the experience pool and sent to the learner, where the policy is trained on the transitions by PPO. After training, new policy is used for self-play in the air combat environment to obtain better transitions than previous policies, and these better transitions can be used for training in order to obtain better policies. Therefore, the self-play architecture can gradually improve the decision-making ability.

2.3. Automatic Curriculum Reinforcement Learning

The goal of RL is to explore the environment to maximize returns [31,32,33]. Usually, the agent at the beginning makes decisions randomly because it has not yet been trained. The agent is trained by the RL algorithm with the samples it acquires, which enables it to learn how to obtain more returns. Then, the trained agent can acquire samples with more returns by means of actions it generates, and by repeating this process continuously, the agent can generate actions with more returns. On the other hand, human data can be used to pre-train agents to enable them to make rational decisions at the beginning [34,35] rather than completely random decisions. After that, RL algorithms are used to train agents for better decisions.

Without pre-training, the training process is more concise (collecting effective expert data requires plenty of time and effort, and getting a well-trained agent requires plenty of time and effort as well). Due to the randomness, agents have a certain exploratory nature; that is, they may be able to find more rewards in the environment by random behaviors. On the other hand, the randomness may be disadvantageous. For example, in an environment with sparse rewards, it is difficult for agents to obtain rewards by random behaviors, so it is difficult for the agent to complete the task.

The maneuvering decision problem in air combat can be regarded as a task with sparse rewards. The schematic diagram of air combat maneuver decision-making is shown in Figure 2, which shows three possible results of decisions. The green aircraft and the blue aircraft represent the two sides of air combat. The solid lines represent the flight trajectories generated by the green aircraft, and the gray triangles represent the off-axis angle of the missile. As shown in the two red trajectories of Figure 2, due to the fact that the azimuth angle of the target is larger than the off-axis angle, the missile will miss the target if it is launched. Only if the target azimuth angle is less than the off-axis angle can the missile possibly hit the target, as shown in the green trajectory of Figure 2.

Therefore, it is difficult for agents to hit the target by random decisions. Meanwhile, in the process of training, we found that these agents cannot make effective decisions. To solve this problem, we propose ACRL to train agents. ACRL decomposes the original task into a task sequence from easy to difficult, and then uses PPO to train the agent to complete all tasks in the task sequence, ultimately enabling the agent to complete the original task. In air combat, the initial state is random, including the initial distance, initial velocity, and initial angle. When the initial azimuth angle of the opponent is larger, it is more difficult for the agent to overcome the opponent. Therefore, we propose a method to automatically generate a task sequence to improve the training efficiency.

Concretely, perform several simulations with random initial states at first, where the initial angles and distances are selected within [−1, 1] (the original intervals are [−180°, 180°] and [4000 m, 16,000 m], respectively. For simplicity, these two intervals are normalized to [−1, 1]), and record the initial angles and distances of the simulations in which missiles hit the targets. The maximum and minimum values of these initial angles and distances are a_max, a_min, b_max, and b_min, respectively. The four values form two intervals [a_min, a_max] and [b_min, b_max], which are proper subsets of [−1, 1]. Therefore, the initial angle and initial distance are first selected within these two intervals, which forms the first sub-task easier than the original task; that is, the agent is trained in the two intervals. If the agent is able to make effective decisions in the two intervals, the interval length will be increased, which corresponds to the second sub-task, [a_min −

δ

, a_max +

δ

] and [b_min −

δ

, b_max +

δ

]. It is easier than the original task but more difficult than the first sub-task. After completing the second sub-task, the length of the interval is changed to [a_min − 2

δ

, a_max + 2

δ

] and [b_min − 2

δ

, b_max + 2

δ

] until the interval becomes [−1, 1].

δ

is set to 0.1, and if the number of wins is greater than 20 and the number of losses, the interval length will be increased. Otherwise, the interval will not be changed.

2.4. Air Combat State

As shown in Table 1, the input of the neural network is a one-dimensional vector with 11 elements:

ψ, γ, v, z, d, f_{1}, ψ_{1}, γ_{1}, d_{1}, β, f_{2}

. Min–max normalization is used, and hyperbolic tangent function is the activation.

3. Experiments

The effectiveness of the proposed method is verified by ablation studies and simulations in this section. The ablation studies compare the training process of ACRL with the original RL algorithm. Simulations verify the decision-making ability of agents trained by ACRL. For each method, five independent experiments are conducted, and the results are recorded, which include forty iterations. A total of 36 past agents are selected randomly in the test to fight against the current agent. The initial distance and initial angle are randomly selected within the corresponding intervals. Hyperparameters are shown in Table 2. We simulate this work in Python. The aircraft model and missile model are written in Python 3.9.7, and the neural networks are trained by Pytorch 1.11.0. Neural networks are fast and light; therefore, we can implement neural networks for practical projects.

3.1. Ablation Studies

Figure 3 shows the changes in the number of wins, losses, and draws in the training process of ACRL, PPO without curriculum (PPO-WC), PPO with reversed curriculum (PPO-RC, from difficult to easy), and SAC [12] with curriculum (SAC-C). The solid line represents the mean of the number of wins, losses, or draws of the corresponding method, and the shaded part represents the standard deviation of the number of wins, losses, or draws.

As shown in Figure 3a, in the training process, the number of wins of ACRL is always greater than that of PPO-WC. The reason why the number of wins of ACRL first decreases and then increases is that the early curricula is easier; thus, the missile is more likely to hit the target, resulting in more wins. As the difficulty of the curricula gradually increases, it becomes difficult for the missile to hit the target, so the number of wins gradually decreases. Therefore, the number of wins of PPO-RC does not increase at the beginning. Since it is difficult to win in reversed curriculum, it is difficult for agents to acquire rewards and improve their abilities, and the number of wins is less than that of ACRL. On the other hand, the agent gradually learns how to make effective decisions, resulting in a gradual increase of the number of wins of ACRL, as shown in Figure 3a. At the same time, more losses also represent better training because both wins and losses are the results of the agent’s self-play.

However, the numbers of wins of PPO-WC, PPO-RC, and SAC-C are much less than that of ACRL at the end of the training, which means that: 1. ACRL is more effective. 2. Both curriculum and PPO are essential. On the other hand, the number of draws for ACRL first increases and then decreases, which is consistent with the results in Figure 3a,b. This is because in early curricula, missiles are more likely to hit the target, resulting in fewer draws. As the difficulty of the curricula increases, it becomes more difficult to hit the target, resulting in more draws. The agent gradually learns effective decisions during training, which results in a decrease of the number of draws.

3.2. Simulation Results

In this section, the maneuvering decision-making ability of the proposed method is verified. We categorize simulations as four different scenarios to emphasize the effectiveness of the proposed method: 1. Two untrained agents. 2. The fortieth agent and the first agent. 3. The fortieth agent and the fifteenth agent. 4. The fortieth agent and the thirtieth agent.

The results of untrained agents, i.e., the results of random decisions, are shown in Figure 4, which shows the three-dimensional trajectory and its top view. The initial distance range is (4000 m, 16,000 m), the initial angle range is (−180°, 180°), and the initial velocity range is (250 m/s, 400 m/s). In Figure 4, the solid line represents the flight trajectory of the aircraft, and the dashed line represents the flight trajectory of the missile. The blue and yellow dots represent the starting points of the two agents, respectively.

In simulation 1, both agents adopt random decisions. The initial yaw angle of Agent 1 is −138.0°, and the initial yaw angle of Agent 2 is −74.4°. From Figure 4b, it can be seen that due to the untrained nature of the agents, they do not make any effective decisions. Agent 1 launches the missile at the beginning of the simulation, while Agent 2 launches the missile after flying forward for some time. However, both missiles miss the target; thus, the simulation ends.

In simulation 2, the initial yaw angle of Agent 1 is 67.6°, and the initial yaw angle of Agent 2 is −60.0°. From Figure 4d, it can be seen that both Agent 1 and Agent 2 choose to launch missiles at the beginning of the simulation, and Missile 2 misses the target. Finally, Missile 1’s flight time exceeds 27 s; thus, the simulation ends. In simulation 3, the initial yaw angle of Agent 1 is −3.3°, and the initial yaw angle of Agent 2 is 36.1°. From Figure 4f, it can be seen that Missile 2 misses the target because of the off-axis angle, while Missile 1 misses the target because of the maximum flight time. As shown in the above scenario, without training, agents maneuver randomly; therefore, they cannot make any rational and effective decisions to hit the target.

Figure 5 shows the results of the fortieth agent and the first agent. Agent 1 represents the fortieth agent, and Agent 2 represents the first agent. In simulation 1, the initial yaw angle of Agent 1 is 56.5°, and the initial yaw angle of Agent 2 is −66.4°. Agent 1 launches the missile after flying for some time. Then, Agent 1 quickly moves away from Agent 2, as this decision can reduce the probability of being hit by Agent 2’s missile. After some time, Agent 2 launches the missile. Finally, Agent 2 is hit by Agent 1’s missile; thus, the simulation ends.

In simulation 2, the initial yaw angle of Agent 1 is 19.2°, and the initial yaw angle of Agent 2 is −50.2°. Agent 1 adopts similar decisions as in simulation 1, which involves first approaching and aiming at the target, then launching the missile and moving away from the target. Meanwhile, Agent 2 is unable to make effective decisions. It launches the missile at the beginning of the simulation, resulting in missing the target. In simulation 3, the initial yaw angle of Agent 1 is −92.0°, and the initial yaw angle of Agent 2 is 101.1°. Agent 1 first turns back and then tracks Agent 2. Due to the long distance between the two agents, Agent 1 does not launch the missile, but Agent 2 keeps flying forward instead of making any effective decisions, and launches the missile without aiming at the target; thus, it misses the target. As shown in the above scenario, the decisions of the fortieth agent are obviously better than that of the first agent, which means that the proposed method is effective.

Figure 6 shows the results of the fortieth agent and the fifteenth agent. Agent 2 represents the fifteenth agent. In simulation 1, the initial yaw angle of Agent 1 is −99.1°, and the initial yaw angle of Agent 2 is −10.4°. Agent 2 does not launch the missile at the beginning, indicating that the agent gradually learns when to launch the missile during the training. Agent 1 first approaches Agent 2 from the side and then launches the missile. Finally, the missile of Agent 1 hits Agent 2.

In simulation 2, the initial yaw angle of Agent 1 is 123.3°, and the initial yaw angle of Agent 2 is 125.0°. Agent 1 launches the missile after targeting Agent 2, but Agent 2 fails to target Agent 1 when launching the missile, resulting in missing the target. Finally, due to Missile 1’s flight time exceeding 27 s, Missile 1 misses the target. In simulation 3, the initial yaw angle of Agent 1 is 100.3°, and the initial yaw angle of Agent 2 is −80.7°. Agent 1 quickly flies away from Agent 2 to avoid being hit by the missile launched by Agent 2. After launching the missile, Agent 2 hardly changes its flight direction and only flies upwards to increase its altitude, so it is hit by Missile 1. As shown in the above scenario, the decisions of the fifteenth agent are better than that of the first agent and still worse than that of the fortieth agent, which means that the agent’s ability gradually improved during training.

Figure 7 shows the results of the fortieth agent and the thirtieth agent. Agent 2 represents the thirtieth agent. In simulation 1, the initial yaw angle of Agent 1 is −155.5°, and the initial yaw angle of Agent 2 is 8.6°. Agent 1 and Agent 2 are facing away from each other at the beginning. Both agents try to maneuver to make the other’s azimuth angle less than the off-axis angle before launching missiles. Agent 1 changes its flight direction after launching the missile to avoid being hit by Agent 2’s missile. Finally, Agent 2 is hit by the missile launched by Agent 1, so the simulation ends.

In simulation 2, the initial yaw angle of Agent 1 is 2.5°, and the initial yaw angle of Agent 2 is 66.1°. Agent 2 is in a status of being pursued by Agent 1. Agent 2 flies forward and launches the missile after some time, but the missile misses the target. Agent 1 first tracks Agent 2 and flies away after launching the missile. Finally, Agent 2 is hit by the missile of Agent 1. In simulation 3, the initial yaw angle of Agent 1 is −63.7°, and the initial yaw angle of Agent 2 is 96.9°. Both agents turn to each other at first, launch missiles, and then stay away from each other after launching. Finally, the missile of Agent 2 hits Agent 1. As shown in the above scenario, the decisions of the thirtieth agent are much better than that of the first and fifteenth agents, and its behavior is very similar to the fortieth agent.

4. Discussion

In this article, agents use continuous action space, starting from completely random decisions, gradually learn how to make effective maneuver decisions by ACRL, and ultimately are able to cope with targets of different situations. Different from previous research, ACRL uses continuous action spaces in maneuver decision-making and regards miss distance as the criterion for air combat rather than whether the target has entered the missile attack zone. Therefore, this study is more in line with reality.

Designing reward functions is a time-consuming job, and unreasonable rewards can prevent agents from effective learning. Different from previous research, ACRL does not require any handcrafted reward function and only uses the results of air combat as reward signals. Meanwhile, according to Figure 3, ACRL can increase the number of wins during training, which indicates that ACRL is a concise and efficient method, and automatic curriculum learning proposed in this article is vital.

At the end of the training, the averages of wins of ACRL, PPO-WC, PPO-RC, and SAC-C are 28.5, 12.0, 21.2, and 15.4, respectively. The averages of draws of ACRL, PPO-WC, PPO-RC, and SAC-C are 63.3, 86.7, 80.0, and 78.9, respectively. These statistics indicate that ACRL is more effective than the other three methods.

As shown in Figure 5b, Figure 6b and Figure 7d,f, agents can make decisions to attack and then move away, which reflects that they have learned to use missiles after training. This ability is acquired through explorations, and there are no handcrafted reward functions throughout the training process to guide the agent in making such decisions, which indicates the effectiveness of ACRL. Meanwhile, these behaviors are interpretable: If one continues to approach the other after launching its missile, it may be hit by the missile launched by the other. Therefore, during the training process, agents learn to stay away from the others after launching missiles in order to reduce the probability of being defeated.

There are also some limitations: 1. The proposed method can only be used in one-on-one air combat. However, real air combat usually contains many planes. Therefore, the method is not suitable for multi-agent systems. 2. The aircraft model is a three-degree model; therefore, it is a simplified and unrealistic model. Meanwhile, more complex and realistic models are more expensive. 3. This paper does not include airplane guns. However, both missiles and guns are common weapons in air combat.

5. Conclusions

In this article, we propose ACRL to solve maneuver decision-making problems, which aims to use missiles to hit targets in different situations or avoid being hit by targets. ACRL has several advantages that previous methods do not have: Its action spaces are continuous rather than discrete, which makes it more realistic. It can make the missile hit the target instead of just aiming at the target. It does not need any handcrafted reward functions, which makes it more concise and efficient. It can make the agent cope with targets in different situations, which indicates its effectiveness, because targets with larger azimuth angles are more difficult to defeat. The decisions made by agents are rational and interpretable. For example, agents learn to attack first and then move away to reduce the probability of being defeated by opponents without handcrafted reward functions.

However, we only investigate one-on-one air combat in this article. In future, we may devote to developing algorithms for multi-agent systems. The multi-agent systems may contain different kinds of models, and these models are more complex and more realistic than the three-degree model.

Author Contributions

Conceptualization, H.Z.; methodology, H.Z.; software, C.H.; validation, H.Z.; formal analysis, H.Z.; investigation, H.Z.; resources, H.Z.; data curation, H.Z.; writing—original draft, Y.W. (Yujie Wei); writing—review and editing, H.Z.; visualization, C.H.; supervision, H.Z.; project administration, H.Z.; funding acquisition, Y.W. (Yuan Wang). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Shaanxi Province (Grant No. 2022JQ-584).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

We thank Jung for his inspiration.

Conflicts of Interest

The authors declare no conflict of interest.

References

Mohammadzadeh, A.; Sabzalian, M.H.; Zhang, C.; Castillo, O.; Sakthivel, R.; El-Sousy, F.F. Modern Adaptive Fuzzy Control Systems; Springer Nature: Berlin, Germany, 2022; Volume 421. [Google Scholar]
Mohammadazadeh, A.; Sabzalian, M.H.; Castillo, O.; Sakthivel, R.; El-Sousy, F.F.; Mobayen, S. Neural Networks and Learning Algorithms in MATLAB; Springer Nature: Berlin, Germany, 2022. [Google Scholar]
Hu, D.; Yang, R.; Zuo, J.; Zhang, Z.; Wu, J.; Wang, Y. Application of deep reinforcement learning in maneuver planning of beyond-visual-range air combat. IEEE Access 2021, 9, 32282–32297. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Eloy, G.; David, W.C.; Dzung, T.; Meir, P. A differential game approach for beyond visual range tactics. arXiv 2020, arXiv:2009.10640v1. [Google Scholar]
Fang, X.; Liu, J.; Zhou, D. Background interpolation for on-line situation of capture zone of air-to-air missiles. J. Syst. Eng. Electron. 2019, 41, 1286–1293. [Google Scholar]
Dantas, J.P.A.; Costa, A.N.; Medeiros, F.L.L.; Geraldo, D.; Marcos, R.O.A.M. Supervised Machine Learning for Effective Missile Launch Based on Beyond Visual Range Air Combat Simulations. arXiv 2022, arXiv:2207.04188v1. [Google Scholar]
Fan, Z.; Xu, Y.; Kang, Y.; Luo, D. Air combat maneuver decision method based on A3C deep reinforcement learning. Machines 2022, 10, 1033. [Google Scholar] [CrossRef]
Mnih, V.; Badia, A.P.; Mirza, M. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1928–1937. [Google Scholar]
Wang, Y.; Zhang, X.W.; Zhou, R.; Tang, S.Q.; Zhou, H.; Ding, W. Research on UCAV maneuvering decision method based on heuristic reinforcement learning. Comput. Intell. Neurosci. 2022, 2022, 1477078. [Google Scholar]
Huang, C.; Dong, K.; Huang, H.; Tang, S.; Zhang, Z. Autonomous air combat maneuver decision using Bayesian inference and moving horizon optimization. J. Syst. Eng. Electron. 2018, 29, 86–97. [Google Scholar] [CrossRef]
Pope, A.P.; Ide, J.S.; Micovic, D.; Diaz, H.; Rosenbluth, D. Hierarchical Reinforcement Learning for Air-to-Air Combat. arxiv 2021, arXiv:2105.00990v2. [Google Scholar]
Alex, G.; Marc, G.B.; Jacob, M.; Rémi, M.; Koray, K. Automated curriculum learning for neural networks. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 1131–1320. [Google Scholar]
Matiisen, T.; Oliver, A.; Cohen, T.; Schulman, J. Teacher–Student Curriculum Learning. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 3732–3740. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
William, H.G.; Brandon, H.; Nicholay, T.; Phillip, W.; Cayden, C.; Manuela, V.; Ruslan, S. Minerl: A large-scale dataset of minecraft demonstrations. arxiv 2019, arXiv:1907.13440. [Google Scholar]
Zhang, Y.Z.; Abbeel, P.; Pinto, L. Automatic Curriculum Learning through Value Disagreement. In Proceedings of the Advances in Neural Information Processing Systems, Virtual Conference, 6–12 December 2020; pp. 1531–1538. [Google Scholar]
Castells, T.; Weinzaepfel, P.; Revaud, J. SuperLoss: A Generic Loss for Robust Curriculum Learning. In Proceedings of the Advances in Neural Information Processing Systems, Virtual Conference, 6–12 December 2020; pp. 1162–1172. [Google Scholar]
Jiang, L.; Meng, D.Y.; Zhao, Q.; Shan, S.G.; Hauptmann, A. Self-paced curriculum learning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; pp. 2694–2700. [Google Scholar]
Stretcu, O.; Platanios, E.A.; Mitchell, T.M.; Póczos, B. Coarse-to-fine curriculum learning. arXiv 2021, arXiv:2106.04072. [Google Scholar]
Sukhbaatar, S.; Lin, Z.; Kostrikov, I.; Synnaeve, G.; Szlam, A.; Fergus, R. Intrinsic motivation and automatic curricula via asymmetric self-play. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 1–3 May 2018; pp. 1459–1466. [Google Scholar]
Rane, S. Learning with Curricula for Sparse-Reward Tasks in Deep Reinforcement Learning. Master’s Thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, May 2020. [Google Scholar]
Pascal, K.; Yang, H.; Carlo, D.; Joni, P.; Jan, P. Curriculum reinforcement learning via constrained optimal transport. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 2535–2544. [Google Scholar]
Wu, J.; Yevgeniy, V. Robust deep reinforcement learning through bootstrapped opportunistic curriculum. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 1–35. [Google Scholar]
Huang, P.; Xu, M.; Zhu, J.; Shi, L.; Fang, F.; Zhao, D. Curriculum reinforcement learning using optimal transport via gradual domain adaptation. In Proceedings of the Advances in Neural Information Processing Systems, Virtual Conference, 1–9 December 2022; pp. 1182–1211. [Google Scholar]
Williams, P. Three-dimensional aircraft terrain-following via real-time optimal control. J. Guid. Control Dyn. 1990, 13, 1146–1149. [Google Scholar] [CrossRef]
Wang, J.; Ding, D.; Xu, M.; Han, B.; Lei, L. Air-to-air missile launchable area based on target escape maneuver estimation. J. Beijing Univ. Aeronaut. Astronaut. 2019, 45, 722–734. [Google Scholar]
Sutton, R.S.; Mcallester, D.A.; Singh, S.P. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the Advances in Neural Information Processing Systems, Breckenridge, CO, USA, 1–2 December 2000; pp. 1057–1063. [Google Scholar]
Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; Moritz, P. Trust region policy optimization. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 1889–1897. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
Adrià, P.B.; Bilal, P.; Steven, K.; Pablo, S.; Alex, V.; Daniel, G.; Charles, B. Agent57: Outperforming the Atari human benchmark. arXiv 2020, arXiv:2003.13350v1. [Google Scholar]
Feryal, B.; Edward, H. Human-timescale adaptation in an open-ended task space. arXiv 2023, arXiv:2301.07608v1. [Google Scholar]
Jin, Y.; Liu, X.; Shao, Y.; Wang, H.; Yang, W. High-speed quadrupedal locomotion by imitation-relaxation reinforcement learning. Nat. Mach. Intell. 2022, 4, 1198–1208. [Google Scholar] [CrossRef]
Silver, D.; Huang, A.; Maddison, C. Mastering the game of Go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef]
Oriol, V.; Igor, B.; Silver, D. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 2019, 575, 350–354. [Google Scholar]

Figure 1. The self-play architecture.

Figure 2. The schematic diagram of air combat maneuver decision-making.

Figure 3. Wins, losses, and draws.

Figure 4. Air combat results of the two untrained agents. (a) The 3D trajectory of simulation 1; (b) Top view of the trajectory of simulation 1; (c) The 3D trajectory of simulation 2; (d) Top view of the trajectory of simulation 2; (e) The 3D trajectory of simulation 3; (f) Top view of the trajectory of simulation 3.

Figure 5. Air combat results of the fortieth agent and the first agent. (a) The 3D trajectory of simulation 1; (b) Top view of the trajectory of simulation 1; (c) The 3D trajectory of simulation 2; (d) Top view of the trajectory of simulation 2; (e) The 3D trajectory of simulation 3; (f) Top view of the trajectory of simulation 3.

Figure 6. Air combat results of the fortieth agent and the fifteenth agent. (a) The 3D trajectory of simulation 1; (b) Top view of the trajectory of simulation 1; (c) The 3D trajectory of simulation 2; (d) Top view of the trajectory of simulation 2; (e) The 3D trajectory of simulation 3; (f) Top view of the trajectory of simulation 3.

Figure 7. Air combat results of the fortieth agent and the thirtieth agent. (a) The 3D trajectory of simulation 1; (b) Top view of the trajectory of simulation 1; (c) The 3D trajectory of simulation 2; (d) Top view of the trajectory of simulation 2; (e) The 3D trajectory of simulation 3; (f) Top view of the trajectory of simulation 3.

Table 1. Air combat state.

State	Symbol	Formula
yaw angle	$ψ$	$ψ = ψ_{0} + \int \frac{g}{v \cos γ} n_{z} \sin μ dt$
pitch angle	$γ$	$γ = γ_{0} + \int \frac{g}{v} (n_{z} \cos μ - \cos γ) dt$
velocity	v	$v = v_{0} + \int g (n_{x} - \sin γ) dt$
altitude	z	$z = z_{0} + \int v \sin γ dt$
distance between the two sides	d	$d = ‖r_{1} - r_{2}‖$
launch missile	f₁	0 or 1
yaw angle of the missile	$ψ_{1}$	$ψ_{m} = ψ_{m 0} + \int \frac{n_{m c} g}{v_{m} \cos γ_{m}} dt$
pitch angle of the missile	$γ_{1}$	$γ_{m} = γ_{m 0} + \int \frac{n_{m h} g}{v_{m}} - \frac{g \cos γ_{m}}{v_{m}} dt$
distance between the missile and the other side	d₁	$d_{1} = ‖r_{m 1} - r_{2}‖$
heading crossing angle	$β$	$β = \arccos (\frac{v_{1} \cdot v_{2}}{‖v_{1}‖ ‖v_{2}‖})$
launch missile from the other side	f₂	0 or 1

Table 2. Hyperparameters.

Hyperparameter	Value
velocity	(250 m/s, 400 m/s)
batch size	1024
optimizer	Adam
actor learning rate	0.0002
critic learning rate	0.001
actor architecture	(256, 256, 4)
critic architecture	(256, 256, 1)
activate function	tanh
epoch	6
$γ$	0.99

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wei, Y.; Zhang, H.; Wang, Y.; Huang, C. Maneuver Decision-Making through Automatic Curriculum Reinforcement Learning without Handcrafted Reward Functions. Appl. Sci. 2023, 13, 9421. https://doi.org/10.3390/app13169421

AMA Style

Wei Y, Zhang H, Wang Y, Huang C. Maneuver Decision-Making through Automatic Curriculum Reinforcement Learning without Handcrafted Reward Functions. Applied Sciences. 2023; 13(16):9421. https://doi.org/10.3390/app13169421

Chicago/Turabian Style

Wei, Yujie, Hongpeng Zhang, Yuan Wang, and Changqiang Huang. 2023. "Maneuver Decision-Making through Automatic Curriculum Reinforcement Learning without Handcrafted Reward Functions" Applied Sciences 13, no. 16: 9421. https://doi.org/10.3390/app13169421

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Maneuver Decision-Making through Automatic Curriculum Reinforcement Learning without Handcrafted Reward Functions

Abstract

1. Introduction

2. Method

2.1. Aircraft Model and Missile Model

2.2. Proximal Policy Optimization

2.3. Automatic Curriculum Reinforcement Learning

2.4. Air Combat State

3. Experiments

3.1. Ablation Studies

3.2. Simulation Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI