1. Introduction
Air control has become more and more important in modern war. In the latest development in this field, the research progress in UAVs has attracted worldwide attention [
1,
2]. In terms of attack, new unmanned attack aircraft and multipurpose UAVs utilize technologies such as precise guidance, data transmission, and automatic control system, which are accurate and powerful. UAVs can choose the time and place to take off at any time and quickly penetrate the target from multiple directions at the same time, making it difficult for the target to carry out effective air defense operations. In terms of tactics, the UAV tactics are very flexible, which can not only directly participate in the attack but also serve as a decoy to cooperate with man–machine operations. With the development of modern artificial intelligence, it is necessary to quickly build an intelligent air combat system to form an intelligent and autonomous air control and air defense integrated air space system combat capability [
3,
4,
5].
In the increasingly complex air combat environment, an advanced intelligent airborne system is built to generate maneuver commands and guide UAVs to perform maneuvers during combat. The traditional autonomous maneuver decision-making methods for UAV air combat are mainly divided into game theory [
6,
7,
8], differential game strategy [
9,
10], influence graph method [
11], and Bayesian theory [
12]. A threat assessment model for UAV air combat and a target allocation problem to search for the best strategy to achieve the UAV air combat mission were established by Liu et al. [
13]. However, the matrix game method suffers from reward delay, and the maneuver decision-making result cannot be guaranteed to be optimal in the entire air combat process. In [
14], air combat is described as a mathematical model of a complete differential game, and the differential game strategy was used to solve the problem. However, because of the limitations of real-time computing, differential games cannot adapt to complex environments and can only be applied to models that accurately describe strategies. In [
15], by describing the mobility decision-making problem in air combat, a state prediction influence graph model was built and applied to short-range air combat. However, the influence graph method relies on prior knowledge, which is difficult to utilize in real-time and dynamic air combat.
Then, researchers linked artificial intelligence with the air combat maneuvering decision-making process and used artificial intelligence systems to simulate pilots’ air combat behavior and extend and expand their maneuvering decision-making abilities. Artificial intelligence methods are mainly divided into expert system methods, genetic algorithms, artificial immune, and neural networks. Among them, establishing the rule base of the expert system is relatively complex and requires constant error correction, and it is difficult to deal with the complex and varying air combat environment [
16,
17]. Genetic learning can solve decision-making problems in unknown environments by optimizing the maneuvering process [
18]. By imitating the biological immune system and evolutionary algorithm, the artificial immune method can automatically generate appropriate maneuvers to deal with the threat of target aircraft in different air combat situations, but the convergence speed of this method is slow [
19]. A neural network is an information processing system established by imitating the structure and function of the human brain neural network. It has excellent self-learning ability and storage ability [
20]. After entering the air combat situation instructions, it outputs the corresponding motion instructions, but it is not conducive to real-time optimization due to the influence of learning samples [
21].
Compared with other artificial intelligence algorithms, reinforcement learning is a learning method that interacts with the environment to obtain air combat superiority under different action commands [
22,
23]. By constructing the mapping relationship between environment and action, it tries to find the optimal solution through continuous attempts. The reward value obtained through the interaction with the environment updates the built Q-function to obtain the reward value of different actions under different air combat situations and then selects the best action value in the subsequent decision-making process [
24]. In [
25], expert experience was introduced to guide the search process of strategy space and to train the Q-function. Because the Q-function is difficult to be applied to complex state space and has a large amount of calculation, deep learning and reinforcement learning were combined in [
26,
27,
28], the neural network was used to replace Q-function for training, and the parameters of the neural network were constantly updated in the training process, which achieved the same effect as Q-function training. In [
29], motion models of aircraft and missiles were built. Through the maneuver decision-making model of aircraft and environment interaction, the continuous state space and reward value of each state were obtained to improve the maneuver decision-making ability of the aircraft. However, the deep Q network (DQN) algorithm cannot output continuous actions, and the agent cannot explore the environment at will. By combining the actor-critical method with the successful experience of DQN, the DDPG algorithm was obtained [
30,
31]. Compared with the traditional deep reinforcement learning algorithm, continuous problems could be solved by outputting continuous actions so that the behavior strategy of the UAV is continuous to ensure the sufficient exploration of state space [
32,
33,
34,
35]. In [
36], the disturbance of an agent on state observation was fully considered, and the DDPG algorithm was improved to achieve high robustness. In [
37], by introducing the mixed noise and the transfer learning method, the self-learning ability and generalization ability of the system were improved.
However, these algorithms use the three-degrees-of-freedom UAV model and do not consider the attitude characteristics of the UAV itself. Because most of the maneuvers used by the target in the training process are basic maneuvers, the particle swarm optimization radial basis function (PSO-RBF) algorithm [
38,
39] was used in this study to make the target aircraft generate simulated manual operation commands in air combat, so that the air combat decision module trained achieved a high air combat efficiency. At the same time, the existing DDPG algorithm was improved to improve its convergence. Compared with the existing DDPG method in the literature, the main contributions of this study can be described in the following key points.
(1) Compared with other three-degrees-of-freedom models, the 6-DOF UAV model can be used for research, which is conducive to engineering practice.
(2) In the construction of advantage function, in addition to analyzing the effect of angle, speed, height, and distance between both sides on air combat factors, the stability of the nonlinear UAV and the effect of its orientation in the environment on air combat situation were also comprehensively considered. The UAV was stable throughout the training process.
(3) Unlike the basic training method, this study allowed the target aircraft to establish the simulated operation instructions by using the PSO-RBF method so that the UAV and target aircraft simulated manual operation can fight, which can improve the effectiveness of the learning algorithm.
(4) The traditional DDPG algorithm was improved. By returning the final reward value to the previous reward function in a certain proportion according to time, the effect of each step on the final air combat result can be reflected and improve the convergence and computational efficiency of the algorithm.
The rest of this manuscript is organized as follows. In
Section 2, the problem statement is given, the design of the 6-DOF model, guidance law, and missile model of the UAV are described, and the comprehensive advantage function of the two aircraft is mentioned. In
Section 3, the PSO-RBF algorithm is introduced. In
Section 4, the learning process of the FRV-DDPG algorithm in air combat is described. In
Section 5, a simulation to verify the effectiveness and efficiency of the proposed algorithm is described.
Section 6 mentions the conclusion of this article.