A Dual Aircraft Maneuver Formation Controller for MAV/UAV Based on the Hybrid Intelligent Agent

Zhao, Luodi; Liu, Yemo; Peng, Qiangqiang; Zhao, Long

doi:10.3390/drones7050282

Open AccessArticle

A Dual Aircraft Maneuver Formation Controller for MAV/UAV Based on the Hybrid Intelligent Agent

¹

School of Automation Science and Electrical Engineering, Beihang University, Beijing 100191, China

²

Digital Navigation Center, Beihang University, Beijing 100191, China

³

Science and Technology on Aircraft Control Laboratory, Beihang University, Beijing 100191, China

⁴

Beijing Aerospace Automatic Control Institute, Beijing 100854, China

^*

Author to whom correspondence should be addressed.

Drones 2023, 7(5), 282; https://doi.org/10.3390/drones7050282

Submission received: 28 February 2023 / Revised: 12 April 2023 / Accepted: 18 April 2023 / Published: 22 April 2023

(This article belongs to the Special Issue Intelligent Coordination of UAV Swarm Systems)

Download

Browse Figures

Versions Notes

Abstract

:

This paper proposes a hybrid intelligent agent controller (HIAC) for manned aerial vehicles (MAV)/unmanned aerial vehicles (UAV) formation under the leader–follower control strategy. Based on the high-fidelity three-degrees-of-freedom (DOF) dynamic model of UAV, this method decoupled multiple-input-multiple-output (MIMO) systems into multiple single-input-single-output (SISO) systems. Then, it innovatively combined the deep deterministic policy gradient (DDPG) and the double deep Q network (DDQN) to construct a hybrid reinforcement learning-agent model, which was used to generate onboard desired state commands. Finally, we adopted the dynamic inversion control law and the first-order lag filter to improve the actual flight-control process. Under the working conditions of a continuous S-shaped large overload maneuver for the MAV, the simulations verified that the UAV can achieve accurate tracking for the complex trajectory of the MAV. Compared with the traditional linear quadratic regulator (LQR) and DDPG, the HIAC has better control efficiency and precision.

Keywords:

MAV/UAV; formation control; hybrid reinforcement learning; hybrid intelligent agent

1. Introduction

Aiming at increasingly fast-paced and high-intensity air combat, the use of MAVs as combat operations leaders with a certain number of UAVs as wingers to form a hybrid formation of UAV/MAV has become the development trend for future air confrontations. Among them, the two-aircraft formation consisting of an MAV and a UAV is one of the most typical combat styles. In MAV/UAV formations, the unmanned system must be able to share information and carry out cooperative operations with the manned systems across systematic boundaries [1]. The Fast Lightweight Autonomy (FLA) Program by the Defense Advanced Research Projects Agency (DARPA) has developed an advanced algorithm that enables an MAV or a UAV to operate autonomously without a human operator, the Global Positioning System (GPS), or any data resources. DARPA’s Lifelong Learning Machines (L2M) Project also aims to develop new machine learning methods that enable unmanned systems to continuously adapt to new environments and remember what they have learned [2]. Meanwhile, the U.S. Air Force’s Loyal Wingman Program aims to enhance the autonomy of UAVs and improve their combat capabilities in complex war environments [3]. Moreover, the recently proposed Skyborg program is working on the combination of manned and unmanned combat aerial vehicles. Therefore, improving the capability of autonomous flight control has become an important direction for the development of future UAV technology.

One of the research hotspots of UAV autonomous control capability is the formation flight-control problem [4]. In terms of the traditional design of the formation controller, Ref. [5] proposed a sliding mode controller for MAV/UAV formation flights based on a layered architecture. However, it makes extensive simplifications on the strong nonlinear dynamic model of MAV/UAVs, and was only validated by simulations for flat trajectories. Ref. [6] considered sensor noise and developed a leader–follower formation PID controller for multi-robots, which can achieve better performance in limiting position deviations. Furthermore, Ref. [7] proposed a parallel approach control law for fixed-wing UAV formations under the leader–follower strategy. Ref. [8] referred to an idea of multi-channel decoupling that split the MIMO system into multiple SISO systems and used sliding mode control to track the reference trajectory, which can be further applied to formation-control problems. Refs. [9,10] proposed a consensus-based multiple aircraft cooperative formation control method, but the consensus theory analysis was highly dependent on the linearized dynamic model, which limited its further application in a complex nonlinear dynamic system. Refs. [11,12] developed a formation controller where the commands were generated independently of the dynamic model, decreasing the control precision in extreme working conditions. Refs. [13,14] considered the confrontation situation and adopted pre-defined maneuver strategy collections, taking typical maneuvers as the basic units and building a collection of maneuver strategies with free combinations of various basic units. However, due to the model uncertainty and non-cooperative environment, this method hardly dealt with complex working conditions. Therefore, the intelligent agent method has become a novel research trend because of its weak model dependence and strong ability in terms of strategy exploration. Refs. [15,16] adopted deep neural networks to learn aircraft-maneuvering strategies and made progress in enhancing the autonomous maneuvering capability of UAVs. However, UAV formation control is a high-dimensional dynamical control problem with tightly coupled variables. When traditional neural networks learn such complex behaviors, they cause problems such as low training efficiency and difficulty in stable convergence [17]. Among the novel neural networks, the double deep Q network (DDQN) algorithm has shown good performance in control problems with discrete action sets by fitting the value functions of state actions through neural networks [18,19,20], but it cannot be applied to control problems with continuous variables. Based on the deterministic policy gradient (DPG) algorithm, DeepMind proposed the deep deterministic policy gradient (DDPG) algorithm which is proven to perform well on many kinds of continuous control problems [21,22,23]. However, in the field of aircraft control, the large variation in the angle of attack commands will increase the load on the attitude control loop [24]. Meanwhile, when it comes to complex tasks with multiple continuous control variables problems, DDPG has problems with unstable networks and low exploration efficiency [25,26,27,28]. For the above dilemma, some scholars have turned to hybrid reinforcement learning methods in recent years. By adding discrete “meta-actions” to continuous control problems, Ref. [29] partially solved the reinforcement learning traps and improved exploration efficiency. The experiments verified its superiority to the traditional continuous strategy algorithm in some cases. [30] proposed the parametrized deep Q-network for the hybrid action space without approximation or relaxation, which provides a reference for solving the hybrid control problem.

Based on the above analysis, it is obvious that the formation controller must be able to better adapt to complex flight conditions in future confrontation situations, e.g., continuous large overload maneuvers for the MAV, etc. Therefore, inspired by [29], we propose a hybrid reinforcement intelligent agent controller based on the decoupling of multi-channels, which can effectively solve the problem of formation-tracking under continuous maneuvering conditions. It should be emphasized that when designing controllers based on artificial intelligent methods, especially when the reinforcement learning controller is directly applied to the generation of underlying flight-control commands, the lack of flight dynamic constraints can easily bring about problems. Due to the lack of dynamic constraints, the attitude control system cannot quickly track the commands, leading to flight instability. Therefore, this paper introduced the dynamic inversion controller and the first-order lag filter to the hybrid reinforcement learning agent to enhance the smoothness and executability of control commands.

In summary, the main contributions of this paper are as follows:

(1) A hybrid intelligent agent was designed based on the novel concept of “meta-action” to further enhance formation control performance. The hybrid intelligent agent combined DDPG and DDQN according to the specific formation control targets;

(2) The framework of the HIAC was developed that combined the dynamic inversion controller and the first-order lag filter with the hybrid intelligent agent to effectively overcome the common drawbacks of reinforcement learning;

(3) The superiority of the HIAC method was validated with experiments of nominal conditions. Monte Carlo simulations with different initial conditions were then conducted to verify the adaptability of the HIAC.

The organization of this paper is as follows: Section 2 establishes the UAV dynamic model and formation-control targets. Section 3 designs the novel formation controller HIAC based on the DDPG/DDQN hybrid intelligent agent. The dynamic inversion controller and first-order lag filter are introduced to the framework of the HIAC as well. Section 4 conducts the experiments of nominal conditions and 100 Monte Carlo simulations with varying initial conditions. Finally, we summarize the research conclusion of this paper in Section 5.

2. Mathematical Modeling

2.1. UAV Dynamic Model

The main concern in dual aircraft formation flights is the real-time position, velocity, and attitude of the two aircraft, so it is necessary to establish a dynamic model of the UAV according to the forces on the mass as shown in Figure 1. To simplify the problem, the constraints flight envelope is ignored.

In the ground inertial coordinate system

o - x y z

,

V

is the UAV flight velocity.

γ

and

ψ

are the flight path angle and flight azimuth angle, respectively. The flight adopts the Bank-To-Turn (BTT), which is considered to have no sideslip.

α

is the attack angle, and

σ

is the bank angle. The engine thrust and drag of the aircraft are denoted by

T

and

D

, respectively.

n

is the normal overload of the UAV in the velocity coordinate system

o - x_{V} y_{V} z_{V}

. Ignoring the wind disturbance in the flight, the three-degrees-of-freedom of the dynamic model for the UAV is established as follows [31,32,33]:

H = \{\begin{matrix} \dot{x} = V \cos γ \sin ψ \\ \dot{y} = V \cos γ \cos ψ \\ \dot{z} = V \sin γ \\ \dot{V} = (T - D) / m - g \sin γ \\ \dot{γ} = g (n \cos σ - \cos γ) / V \\ \dot{ψ} = - g n \sin σ / (V \cos γ) \end{matrix},

(1)

where

m

is the weight of the aircraft, which is considered constant in this paper, and

g

is the local gravity.

The engine thrust

T

can be denoted by

T = η T_{\max},

(2)

where

η

is the throttle manipulator, and its range is defined as

[0, 1]

.

T_{\max}

is the maximum thrust that the engine can achieve.

The air drag

D

consists of the parasite drag and the induced drag, which can be expressed as follows [31]:

D = C_{D_{P}} ρ V^{2} S / 2 + 2 C_{D_{I}} n^{2} m^{2} g^{2} / (ρ V^{2} S),

(3)

where

S

is the reference area of the UAV.

C_{D_{P}}

is the parasite drag coefficient.

C_{D_{I}}

is the induced drag coefficient.

ρ

is the atmospheric density, which varies with the altitude of the aircraft in the stratosphere. It is calculated by [34]

ρ = ρ_{0} \cdot e^{- z / z_{0}},

(4)

where

ρ_{0} = 1.225

kg/m³ and

z_{0} = 6700

m.

2.2. Formation Control Targets

In this paper, the formation control target of the UAV was determined based on the leader–follower formation strategy. Taking a typical dual aircraft formation flight as an example, the formation configuration of the MAV/UAV was designed as shown in Figure 2. Since the reference trajectory of the MAV as the leader aircraft is known, the flight velocity, attitude, and position can be obtained from the sensors mounted within the MAV. The winger aircraft can receive real-time flight data from the MAV through the onboard data chain and complete the trajectory tracking and formation control autonomously. During the flight, it is required that the UAV and MAV keep a specific formation throughout the whole flight, as shown in Figure 2.

2.2.1. Flight Velocity Control Targets

The MAV and UAV keep the same formation flight velocity. The reference velocity of the MAV is

V_{L}

, and the UAV velocity is

V_{W}

, then the velocity deviation

Δ V

is

Δ V = |V_{L} - V_{W}| .

(5)

The MAV and UAV keep the same flight path angle in formation flight. The MAV flight path angle is

γ_{L}

, and the UAV flight path angle is

γ_{W}

, then the flight path angle deviation

Δ γ

is

Δ γ = |γ_{L} - γ_{W}| .

(6)

The MAV and UAV keep the same flight azimuth angle in formation flight. The flight azimuth angle of the MAV is

ψ_{L}

, and the flight azimuth angle of the UAV is

ψ_{W}

, then the deviation of the flight azimuth angle

Δ ψ

is

Δ ψ = |ψ_{L} - ψ_{W}| .

(7)

The flight velocity and attitudes of the UAV should be consistent with the MAV within an allowable error

Δ V \leq V_{Δ \max}, Δ γ \leq γ_{Δ \max}, Δ ψ \leq ψ_{Δ \max},

(8)

where

V_{Δ \max}

,

γ_{Δ \max}

,

ψ_{Δ \max}

represent the error thresholds of the velocity, flight path angle, and flight azimuth angle of the UAV, respectively.

2.2.2. Flight Distance Control Targets

The UAV is located around the MAV and maintains the specified formation distance.

Δ D

denote the distance between the MAV and the UAV in the ground inertial coordinate system.

Δ D_{x}

,

Δ D_{y}

and

Δ D_{z}

denote the spatial distance of

Δ D

as follows:

Δ D = \sqrt{Δ D_{x}^{2} + Δ D_{y}^{2} + Δ D_{z}^{2}} .

(9)

Summarily, the UAV should keep a distance larger than the safe flight distance from the MAV, which is as follows:

D_{Δ \min} \leq Δ D \leq D_{Δ \max},

(10)

where

D_{Δ \min}

and

D_{Δ \max}

represent the thresholds of the safe distance.

3. Design of the HIAC

The HIAC first adopted a DDPG/DDQN hybrid reinforcement learning method to train the agent model to generate the tracking commands. Then, we further designed a dynamic inversion controller and a first-order lag filter to construct an improved formation flight controller. Overall, the HIAC consists of three parts, i.e., desired state command solver, dynamic inversion controller, and first-order lag filter. The framework of the HIAC is shown in Figure 3.

In order to track the MAV, the HIAC adopted the current deviation between the states of the UAV and the states of the MAV, i.e.,

Δ V

,

Δ γ

and

Δ ψ

as inputs, and outputs the control commands of the thrust, normal overload, and bank angle, i.e.,

η

,

n

and

σ

. The main difference of the HIAC from other traditional controllers is that to further enhance the control accuracy, the HIAC adopted a hybrid intelligent agent as the desired command solver to generate the desired commands,

V_{c}

,

γ_{c}

and

ψ_{c}

. Then, these commands were sent to the dynamic inversion controller to generate the control commands,

η_{c}

,

n_{c}

and

σ_{c}

. Finally, the first-order lag filter further smoothed

η_{c}

,

n_{c}

and

σ_{c}

to improve the executability of these commands. The three parts will be introduced in detail as the order of the information flow.

3.1. Desired Command Solver

Learning from the idea of “meta-action”, we partially discretized the control variables in the continuous control problems and developed a continuous–discrete mixed action space according to the characteristics of these control variables. Based on this process, we constructed a hybrid intelligent agent based on DDPG and DDQN to control

V

,

γ

,

ψ

and

D

of the UAV.

3.1.1. Framework of Hybrid Intelligent Agent Based on DDPG/DDQN

Based on the traditional Q-Learning algorithm, DDQN uses the neural network to fit the value function. It adopts discrete action sets to define the strategy and evaluates the Q value of the generated strategy through the Critic network. Compared with the traditional DQN algorithm [18,19,20], DDQN decouples the action selection strategy of the Q value and the calculation of the Q value and solves the problem of overestimation of the Q value compared with the traditional methods.

DDPG adopts the Actor–Critic network based on DQN and uses continuous action sets to define the control strategy. The model consists of the Actor–Critic network, where the Critic evaluates the actions generated by the Actor, and the Actor feeds back the evaluation results to the Critic for policy optimization [23]. More proofs and conclusions of the DDQN and DDPG can be found in [18,23], respectively.

However, the DDQN and DDPG suffer from different drawbacks when applied in practical engineering. Although the DDQN is easier to converge when compared with DDPG, it can only deal with discrete and low-dimensional action spaces. However, most of the practical targets, especially physical control targets, have continuous and high-dimensional action spaces. Moreover, even though the continuous space can be transferred into the discrete space, DDQN will generate high high-dimensional action space in this process and finally cause quite low computational efficiency. Meanwhile, although DDPG can solve the problem of continuous and high-dimensional action spaces, it is more likely to diverge than DDQN. Therefore, learning from “meta-action”, we proposed a hybrid intelligent agent combining the DDQN and DDPG according to their complementary characteristics. Considering the value range and the control precision of

V

,

γ

and

ψ

, we adopted the idea of multi-channel decoupling to perform partial discretization of the action space. For the velocity control agent

V_{c}

, the DDPG was used to generate the set of continuous state commands. Because the value range of

V_{c}

is larger than

γ_{c}

and

ψ_{c}

, discretizing the continuous action space with high precision will lead to dimension explosion. Meanwhile, for the angle control agents

γ_{c}

and

ψ_{c}

, the DDQN is used to generate the set of discretized state commands. Combining the DDQN and DDPG can improve the capability of convergence when these two agents are trained together.

The framework of the desired commands solver was designed as shown in Figure 4. It includes three agents which process the variation of the state commands

V_{c}

,

γ_{c}

and

ψ_{c}

, respectively. Based on the decoupling principles between different agents, each agent calculates the action

A_{V}

,

A_{γ}

and

A_{ψ}

, and updates the desired state commands respectively. The outputs are executed by the flight-control system of the UAV and fed back the rewards of each agent. The total reward function

R_{\sum}

is expressed by

R_{\sum} = R_{\sum}^{(D, V)} + R_{\sum}^{(γ)} + R_{\sum}^{(ψ)},

(11)

where

R_{\sum}^{(D, V)}

,

R_{\sum}^{(γ)}

and

R_{\sum}^{(ψ)}

are components of

R_{\sum}

in each agent.

To construct an intelligent agent based on the DDPG/DDQN hybrid reinforcement learning network, it was necessary to transform the trajectory tracking problem into a Markov decision process, which mainly includes three parts, i.e., the state space, the action space, and the reward function.

3.1.2. State Space $S$

According to the targets of formation flight control, the state space

S

is designed as follows:

S = [Δ V, Δ D, Δ \dot{V}, Δ γ, \int Δ γ d t, Δ \dot{γ}, Δ ψ, \int Δ ψ d t, Δ \dot{ψ}],

(12)

where

Δ D

equals

Δ D = Δ D_{0} + \int Δ V d t,

(13)

where

Δ D_{0}

is the flight distance deviation between the MAV and the UAV at the initial epoch. The integral items

Δ V

,

Δ γ

and

Δ ψ

are the cumulative deviation from the initial epoch till the current epoch.

Δ \dot{V}

,

Δ \dot{γ}

and

Δ \dot{ψ}

is the deviation rate of the velocity, flight azimuth angle, and flight path angle.

3.1.3. Action Space $A$

The action space

A

is defined as follows:

A = [A_{V}, A_{γ}, A_{ψ}],

(14)

where the action

A_{V}

denotes the correction value of the UAV velocity commands

Δ V_{c}

, the action

A_{γ}

denotes the correction value of the UAV flight path angle commands

Δ γ_{c}

, and the action

A_{ψ}

denotes the correction value of the UAV flight azimuth angle commands

Δ ψ_{c}

, i.e.,

\{\begin{array}{l} Δ V_{c} = A_{V}, Δ γ_{c} = A_{γ}, Δ ψ_{c} = A_{ψ} \\ |Δ V_{c}| \leq λ_{V_{c} \max}_{c}, |Δ γ_{c}| \leq λ_{γ_{c} \max}, |Δ ψ_{c}| \leq λ_{ψ_{c} \max}, \end{array}

(15)

where

λ_{V_{c} \max}

,

λ_{γ_{c} \max}

and

λ_{ψ_{c} \max}

are the maximum of corrections, respectively.

A_{V}

is used to generate the set of continuous velocity commands.

A_{γ}

,

A_{ψ}

is used to generate the set of discretized angle commands. Specifically, the discretization can be further expressed as follows:

\begin{array}{l} |Ω_{γ_{c}}| = 2 ⌈λ_{γ_{c} \max} / \partial γ_{c}⌉ + 1 \\ |Ω_{ψ_{c}}| = 2 ⌈λ_{ψ_{c} \max} / \partial ψ_{c}⌉ + 1 \end{array}

(16)

\begin{array}{l} Ω_{γ_{c}} = \{A_{γ}| 0, \pm \partial γ_{c}, \pm 2 \partial γ_{c}, \dots, \pm (|Ω_{γ_{c}}| - 1) \partial γ_{c} / 2, \pm λ_{γ_{c} \max}\} \\ Ω_{ψ_{c}} = \{A_{ψ}| 0, \pm \partial ψ_{c}, \pm 2 \partial ψ_{c}, \dots, \pm (|Ω_{ψ_{c}}| - 1) \partial ψ_{c} / 2, \pm λ_{ψ_{c} \max}\} . \end{array}

(17)

Then, the update of desired state commands is

\{\begin{cases} V_{c} \leftarrow V_{c} + Δ V_{c} \\ γ_{c} \leftarrow γ_{c} + Δ γ_{c} \\ ψ_{c} \leftarrow ψ_{c} + Δ ψ_{c} . \end{cases}

(18)

3.1.4. Reward Function $R_{\sum}$

According to the formation control targets of UAVs, the reward function

R_{\sum}

was designed as follows:

R_{\sum} = R_{P} + R_{N} + R_{C}

(19)

where

R_{P}

is the reward sub-function, which gives a positive response when the flight state of the UAV meets the control targets.

R_{N}

is the penalty sub-function, which gives a negative response when the flight states exceed the allowable error of the control target.

R_{C}

is the command limiting function, which can limit the values of the control commands

η_{c}

,

n_{c}

,

σ_{c}

. More specifically,

R_{C}

can smooth the variation of the control commands to finally reduce energy consumption.

R_{P}

is calculated by

R_{P} = 10 \times (ε_{D}^{2} + ε_{V}^{2} + ε_{γ}^{2} + ε_{ψ}^{2})

(20)

where

ε_{D}

,

ε_{V}

,

ε_{γ}

, and

ε_{ψ}

are reward coefficients, which are defined as follows

\begin{array}{l} ε_{D} = \{\begin{cases} 1, D_{Δ \min} \leq Δ D \leq D_{Δ}_{\max} \\ 0, Δ D < D_{Δ \min} or Δ D > D_{Δ \max} \end{cases}, \\ ε_{V} = \{\begin{cases} 1 - Δ V / V_{Δ \max}, Δ V \leq V_{Δ \max} \\ 0, Δ V > V_{Δ \max} \end{cases}, \\ ε_{γ} = \{\begin{cases} 1 - Δ γ / γ_{Δ \max}, Δ γ < γ_{Δ \max} \\ 0, Δ γ \geq γ_{Δ \max} \end{cases}, \\ ε_{ψ} = \{\begin{cases} 1 - Δ ψ / ψ_{Δ \max}, Δ ψ < ψ_{Δ \max} \\ 0, Δ ψ \geq ψ_{Δ \max} \end{cases} . \end{array}

(21)

R_{N}

is calculated by

R_{N} = - 100 \times (e_{D}^{2} + e_{V}^{2} + e_{γ}^{2} + e_{ψ}^{2})

(22)

where

e_{D}

,

e_{V}

,

e_{γ}

, and

e_{ψ}

are penalty coefficients, which are defined as follows:

\begin{array}{l} e_{D} = \{\begin{cases} 1, Δ D < D_{Δ \min} or Δ D > D_{Δ \max} \\ 0, D_{Δ \min} \leq Δ D \leq D_{Δ \max} \end{cases}, \\ e_{V} = \{\begin{cases} 1, Δ V > 2 V_{Δ \max} \\ Δ V / V_{Δ \max} - 1, V_{Δ \max} \leq Δ V \leq 2 V_{Δ \max} \\ 0, Δ V < V_{Δ \max} \end{cases}, \\ e_{γ} = \{\begin{cases} 1, Δ γ > 2 γ_{Δ \max} \\ Δ γ / γ_{Δ \max} - 1, γ_{Δ \max} \leq Δ γ \leq 2 γ_{Δ \max} \\ 0, Δ γ < γ_{Δ \max} \end{cases}, \\ e_{ψ} = \{\begin{cases} 1, Δ ψ > 2 ψ_{Δ \max} \\ Δ ψ / ψ_{Δ \max} - 1, ψ_{Δ \max} \leq Δ ψ \leq 2 ψ_{Δ \max} \\ 0, Δ ψ < ψ_{Δ \max} \end{cases} . \end{array}

(23)

R_{C}

is calculated by

R_{C} = - 0.2 (|η_{c}| / η_{\max} + |n_{c}| / n_{\max} + |σ_{c}| / σ_{\max}) .

(24)

3.2. Dynamic Inversion Controller

To realize tracking of the commands of a given flight trajectory, the dynamic inversion control law was designed as follows [35]

\begin{array}{l} {\dot{V}}_{c} = ϖ_{V} (V_{c} - V) \\ {\dot{γ}}_{c} = ϖ_{γ} (γ_{c} - γ) \\ {\dot{ψ}}_{c} = ϖ_{ψ} (ψ_{c} - ψ) \end{array}

(25)

where

ϖ_{V}

,

ϖ_{γ}

, and

ϖ_{ψ}

denote the bandwidth of the controller, respectively.

V_{c}

,

γ_{c}

,

ψ_{c}

denote the desired state commands of the flight velocity, the flight path angle, and the flight azimuth angle, respectively.

Since the UAV commands follow the dynamic constraints by Equation (1), considering Equations (1), (2), and (25) yields

\begin{array}{l} T_{c} = η_{c} T_{\max} = [D + m ϖ_{V} (V_{c} - V) + m g \sin γ], \\ N_{γ} = ϖ_{γ} V (γ_{c} - γ) / g + \cos γ, \\ N_{ψ} = ϖ_{ψ} V (ψ_{c} - ψ) \cos γ / g, \end{array}

(26)

where

N_{γ}

and

N_{ψ}

denote the normal overload and lateral overload, respectively. The throttle

δ_{c}

, normal overload

n_{c}

and bank angle

σ_{c}

are selected as the control commands. Then, the UAV control command was designed as follows:

F = \{\begin{cases} η_{c} = [D + m ϖ_{V} (V_{c} - V) + m g \sin γ] / T_{\max} \\ n_{c} = \sqrt{N_{γ}^{2} + N_{ψ}^{2}} \\ σ_{c} = \arctan (N_{ψ} / N_{γ}) \end{cases} .

(27)

Moreover, the control command must satisfy the constraints:

η_{\min} \leq η_{c} \leq η_{\max}, 0 \leq n_{c} \leq n_{\max}, |σ_{c}| \leq σ_{\max}

(28)

where

η_{\min}

and

η_{\max}

is the minimum and maximum values of the throttle commands, respectively.

n_{\max}

is the maximum value of the normal overload, and

σ_{\max}

is the maximum value of the bank angle.

3.3. First-Order Lag Filter

Considering the fact that the UAV cannot instantly complete the change of the engine thrust, normal overload, and bank angle, a first-order lag filter model was constructed to simulate the delayed variation processes of these three variables:

G = \{\begin{cases} \dot{η} = (η_{c} - η) / τ_{δ} \\ \dot{n} = (n_{c} - n) / τ_{n} \\ \dot{σ} = (σ_{c} - σ) / τ_{σ} \end{cases},

(29)

where

η_{c}

,

n_{c}

,

σ_{c}

represent the control commands of the throttle, normal overload, and bank angle, respectively.

τ_{δ}

,

τ_{n}

, and

τ_{σ}

represent the response time of the UAV control system accordingly.

Summarily, considering Equations (1), (27), and (29), the UAV flight process can be presented by the control equations as follow:

\{\begin{cases} F {(V_{c}, γ_{c}, ψ_{c})}^{T} = {[η_{c}, n_{c}, σ_{c}]}^{T} \\ G {(η_{c}, n_{c}, σ_{c})}^{T} = {[\dot{η}, \dot{n}, \dot{σ}]}^{T} \\ H {(η, n, σ)}^{T} = {[V, γ, ψ]}^{T} \end{cases} .

(30)

Equation (30) reveals the calculation process from the desired control commands to the actual control commands. It is clear that the premise to realize the formation flight is to acquire the desired control commands of the UAV

V_{c}

,

γ_{c}

,

ψ_{c}

under the specific formation strategy. Then, the ultimate flight trajectory can be obtained by the Runge–Kutta method.

4. Simulation Validation

4.1. Simulation Design

Based on the 3-DOF dynamic model in this paper, the MAV was designed to make a complex maneuver and provide the reference trajectory and control commands, accordingly. Under the leader–follower formation strategy, the UAV adopts the HIAC, DDPG, and LQR to track the MAV and keep the dual aircraft formation, respectively. LQR is a commonly used guidance method for tracking multi-state trajectories in aerospace engineering and it has been validated by extensive flight tests [36,37]. Therefore, we compared the proposed method with LQR and DDPG to verify its superiority in the following Section 4.2 and Section 4.3. The design of DDPG is described in Section 3.1.

First, the experiment of nominal conditions was conducted to analyze the superiority of the proposed method in detail. Meanwhile, the initial values greatly affect the performance of the reinforcement learning models. Therefore, the generalization ability of the model was required to be fully verified. Then, 100 Monte Carlo experiments were conducted to verify the adaptability of this method to different initial conditions.

The simulations were conducted by Matlab2021a and the 3-DOF dynamic model was built by Simulink. The total simulation time was

T

, the simulation interval was

Δ T

, and the specific experimental parameters are shown in Table 1.

The training methods of DDPG and DDQN refer to [18,23], respectively. Learning rate, max episode, discount factor, and experience buffer length were set as the same for both DDPG and DDQN. In addition, the batch size of DDPG was set to 256, and the batch size of DDQN was set to 64. The specific parameters are shown in Table 2.

4.2. Basic Principles of LQR

The implementation of LQR mainly includes three parts: linearization of the motion model, design of the tracking controller for the reference trajectory, and solution of the feedback gain matrix.

By linearizing the dynamic model of the UAV in Equation (1) with small deviations, the linear system can be obtained as follows:

\dot{X} = A X + B u .

(31)

Equation (31) can be expressed by

[\begin{array}{l} δ \dot{x} \\ δ \dot{y} \\ δ \dot{z} \\ δ \dot{V} \\ δ \dot{γ} \\ δ \dot{ψ} \end{array}] = [\begin{matrix} A_{11} & A_{12} & A_{13} & A_{14} & A_{15} & A_{16} \\ A_{21} & A_{22} & A_{23} & A_{24} & A_{25} & A_{26} \\ A_{31} & A_{32} & A_{33} & A_{34} & A_{35} & A_{36} \\ A_{41} & A_{42} & A_{43} & A_{44} & A_{45} & A_{46} \\ A_{51} & A_{52} & A_{53} & A_{54} & A_{55} & A_{56} \\ A_{61} & A_{62} & A_{63} & A_{64} & A_{65} & A_{66} \end{matrix}] [\begin{array}{l} δ x \\ δ y \\ δ z \\ δ V \\ δ γ \\ δ ψ \end{array}] + [\begin{matrix} B_{11} & B_{12} & B_{13} \\ B_{21} & B_{22} & B_{23} \\ B_{31} & B_{32} & B_{33} \\ B_{41} & B_{42} & B_{43} \\ B_{51} & B_{52} & B_{53} \\ B_{61} & B_{62} & B_{63} \end{matrix}] [\begin{array}{l} δ η \\ δ n \\ δ σ \end{array}] .

(32)

Set the given MAV trajectory as the reference, the state space is defined as follows:

\begin{array}{l} δ x = x_{W} - x_{L}, δ y = y_{W} - y_{L}, δ z = z_{W} - z_{L}, \\ δ V = V_{W} - V_{L}, δ γ = γ_{W} - γ_{L}, δ ψ = ψ_{W} - ψ_{L} . \end{array}

(33)

The control commands are defined as follows:

δ η = η - η_{L}, δ n = n - n_{L}, δ σ = σ - σ_{L},

(34)

where

A

and

B

are the partial derivative coefficient matrix calculated according to the motion differential equation and the feature points of the reference trajectory. The calculation results are as follows:

\begin{array}{l} A_{11} = A_{12} = A_{13} = 0, \\ A_{14} = \cos γ \sin ψ, A_{15} = - V \sin γ \sin ψ, A_{16} = V \cos γ \cos ψ, \\ A_{21} = A_{22} = A_{23} = 0, \\ A_{24} = \cos γ \cos ψ, A_{25} = - V \sin γ \cos ψ, A_{26} = - V \cos γ \sin ψ, \\ A_{31} = A_{32} = A_{33} = A_{36} = 0, \\ A_{34} = \sin γ, A_{35} = V \cos γ, \\ A_{41} = A_{42} = A_{46} = 0, A_{43} = D_{z} / m, \\ A_{44} = D_{V} / m, A_{45} = - g \cos γ, \\ A_{51} = A_{52} = A_{53} = A_{56} = 0, \\ A_{54} = - g (n \cos σ - \cos γ) / V^{2}, A_{55} = g \sin γ / V, \\ A_{61} = A_{62} = A_{63} = A_{66} = 0, \\ A_{64} = g \sin σ n / (V^{2} \cos γ), A_{65} = - g n \sin σ \sin γ / (V \cos^{2} γ), \\ B_{11} = B_{12} = B_{13} = B_{21} = B_{22} = B_{23} = B_{31} = B_{32} = B_{33} = 0, \\ B_{41} = T_{\max} / m, B_{42} = B_{43} = 0, \\ B_{51} = - D_{n} / m, B_{52} = g \cos σ / V, B_{53} = - g n \sin σ, \\ B_{61} = 0, B_{62} = - g \sin σ / (V \cos γ), B_{63} = - g n \cos σ / (V \cos γ) . \end{array}

(35)

where

D_{z}

,

D_{V}

and

D_{n}

are the partial derivatives of the drag

D

on the feature point of the reference trajectory to the flight height

z

, velocity

V

and normal overload

n

respectively. Define the optimal control performance index from

t_{0}

to

t_{f}

as follows:

J = 0.5 \int_{t_{0}}^{t_{f}} [X^{T} (t) Q X (t) + u^{T} (t) R u (t)] dt,

(36)

where

Q

and

R

are the weight matrices of state and control respectively.

Q

is positive semi-definite and

R

is positive-definite. Then, there exists an optimal control law

u^{*} = - K^{*} X

to minimize the above performance index, and the feedback gain matrix

K^{*}

is

K^{*} = [\begin{matrix} K_{η 1} & K_{η 2} & K_{η 3} & K_{η 4} & K_{η 5} & K_{η 6} \\ K_{n 1} & K_{n 2} & K_{n 3} & K_{n 4} & K_{n 5} & K_{n 6} \\ K_{σ 1} & K_{σ 2} & K_{σ 3} & K_{σ 4} & K_{σ 5} & K_{σ 6} \end{matrix}],

(37)

K^{*} = - R^{- 1} B^{T} P

(38)

where

P

is the solution of the Riccati equation. It is calculated by

- P A - A^{T} P + P B R^{- 1} B^{T} P - Q = 0 .

(39)

Define

Q

and

R

as follows:

\begin{array}{l} Q = diag [Q_{1}, Q_{2}, Q_{3}, Q_{4}, Q_{5}, Q_{6}], \\ R = diag [R_{1}, R_{2}, R_{3}] . \end{array}

(40)

To reflect the impact of the flight relative distance in the dual aircraft formation flight. Set

Q_{1} = Q_{2} = Q_{3}

and define

δ D^{2} = Δ D^{2} = δ x^{2} + δ y^{2} + δ z^{2}

, then

\begin{array}{l} J = & 0.5 \int_{t_{0}}^{t_{f}} [(Q_{1} δ D^{2} + Q_{4} δ V^{2} + Q_{5} δ γ^{2} + Q_{6} δ ψ^{2}) \\ + (R_{1} δ η^{2} + R_{2} δ n^{2} + R_{3} δ σ^{2})] dt . \end{array}

(41)

According to Bryson Law [38],

Q

and

R

are set as follows:

\begin{array}{l} Q_{1} D^{2}_{Δ \max} = Q_{4} V^{2}_{Δ \max} = Q_{5} γ^{2}_{Δ \max} = Q_{6} ψ^{2}_{Δ \max} \\ = R_{1} η^{2}_{\max} = R_{2} n^{2}_{\max} = R_{3} σ^{2}_{\max} . \end{array}

(42)

Set

Q_{1} = 1

, then other parameters can be obtained. According to

u^{*}

, the control commands can be obtained:

\begin{array}{l} η = η_{L} - (K_{η 1} δ x + K_{η 2} δ y + K_{η 3} δ z + K_{η 4} δ V + K_{η 5} δ γ + K_{η 6} δ ψ), \\ n = n_{L} - (K_{n 1} δ x + K_{n 2} δ y + K_{n 3} δ z + K_{n 4} δ V + K_{n 5} δ γ + K_{n 6} δ ψ), \\ σ = σ_{L} - (K_{σ 1} δ x + K_{σ 2} δ y + K_{σ 3} δ z + K_{σ 4} δ V + K_{σ 5} δ γ + K_{σ 6} δ ψ) . \end{array}

(43)

Since the feedback gains obtained at different feature points of the reference trajectory are different, the monotonic flights can be selected as an independent variable, and the feedback gain coefficient of the offline design can be interpolated to obtain the corresponding control commands.

4.3. Experiment of Nominal Conditions

In the experiment of nominal conditions, the initial position of the MAV was

x_{L 0} = 0

m,

y_{L 0} = 0

m,

z_{L 0} = 10,000

m,

V_{L 0} = 400

m/s,

γ_{L 0} = π / 6

,

ψ_{L 0} = 0

. The initial position of the UAV was

x_{W 0} = 100

m,

y_{W 0} = 100

m,

z_{W 0} = 10,000

m,

V_{W 0} = 400

m/s,

γ_{W 0} = π / 6

,

ψ_{W 0} = 0

.

The formation flight trajectories of MAV and UAV of the three methods are shown in Figure 5. The MAV is designed to make continuous S-shaped large maneuver with a maximum overload of about 4 g at 1 s, 11 s, 29 s and 41 s, respectively. Figure 5 indicates that the LQR, DDPG, and HIAC can realize the stable tracking of the given trajectory of the MAV under large, overloaded maneuvers and reach the target of the designed formation.

Figure 6a–c shows the control commands of the UAV, i.e., the thrust, the normal overload, and the bank angle, generated by the LQR, DDPG and HIAC, respectively, with reference commands of the MAV. Figure 7a–c shows the errors between the control commands of the LQR, DDPG, and HIAC and the reference commands of the MAV. Figure 6 illustrates that there are four peaks in the curves of the control commands due to the four large, overloaded maneuvers. Moreover, compared with the LQR and DDPG, the trend of the control commands of the HIAC can be better consistent with the MAV in thrust, normal overload, and bank angle. Especially in the control of the normal overload, the HIAC has mitigated the sharp change of the commands generated by the reinforcement learning controller to a certain extent. It can provide more smooth and executable control commands under large maneuvers. However, during the large maneuver of the MAV, in order to track the reference commands, it inevitably generates a certain amount of extra adjustment for the thrust, overload, and bank angle for the three methods.

Figure 8 shows the change of the three controlled states of the UAV, i.e., the velocity, the flight path angle, and the flight azimuth angle, generated by LQR, DDPG and HIAC. Figure 9 shows the deviation of the three controlled states and the relative distance. It can be seen from Figure 8 that the change trend of the controlled state of the HIAC is basically the same as that of the MAV, and the formation maintenance performance is obviously better than that of the LQR and DDPG. Especially, the HIAC can keep up with most of the fluctuations of the MAV in the flight velocity and the flight azimuth angle. Moreover, Figure 9 shows that compared with the LQR and DDPG, the control precision of the HIAC has been significantly improved, and the control deviation can rapidly decrease to nearly 0 under the large maneuver. Figure 9d indicates that the HIAC successfully limits the formation distance within the safe distance between 100 m and 600 m while LQR and DDPG fail. The LQR continuously accumulates distance deviation due to the velocity deviation during the flight, and ultimately, the formation distance reveals a divergent trend. Meanwhile, although the relative distance of the DDPG gradually converges, it still extends beyond the safe distance at the end of the flight.

Table 3 presents the root mean square (RMS) errors and maximum errors of the four controlled states of the LQR, DDPG and HIAC. It is clear that both the RMS error and maximum error of the HIAC are smaller than those of the LQR and DDPG. Moreover, the HIAC has a reduction of 5.81%, 70.44%, and 64.95%, respectively, in the RMS error of the velocity, flight path angle and flight azimuth angle compared with the LQR, and has a reduction of 60.35%, 55.32% and 69.47% in the maximum error of velocity, flight path angle and flight azimuth angle, respectively, compared with the LQR. The HIAC has a reduction of 36.10%, 35.85% and 51.61%, respectively, in the RMS error of velocity, flight path angle and flight azimuth angle compared with the DDPG, and has a reduction of 54.43%, 31.57% and 55.01% in the maximum error of velocity, flight path angle and flight azimuth angle, respectively, compared with the DDPG.

In summary, the proposed HIAC significantly improves the state control performance and guarantees that the flight distance stays within a safe distance as well.

4.4. Monte Carlo Experiments

In order to further test how the HIAC adapts to various initial conditions, 100 Monte Carlo simulations were carried out by adding random deviations to the nominal conditions.

The initial position of the MAV is

x_{L 0} = 0

m,

y_{L 0} = 0

m,

z_{L 0} = 10,000

m,

V_{L 0} = 400

m/s,

γ_{L 0} = π / 6

,

ψ_{L 0} = 0

. The baseline of initial values of the UAV is

x_{W 0} = 100

m,

y_{W 0} = 100

m,

z_{W 0} = 10,000

m,

V_{W 0} = 400

m/s,

γ_{W 0} = π / 6

,

ψ_{W 0} = 0

. Then, random deviations which follow the uniform distributions were added to these six baselines, respectively. The specific values of the deviations are presented in Table 4.

Figure 10 is the scatterplot of the Monte Carlo simulation results of the velocity errors, flight path angle error, flight azimuth angle, and relative distance for the LQR, DDPG and HIAC. For each evaluation index, the horizontal axis is the RMS error, and the vertical axis is the maximum error. It can be seen that the HIAC can fulfill the control target in the magnitude of velocity. Meanwhile, because the training threshold is set quite strictly in order to achieve better control performance, the maximum error and RMS error of the flight path angle and flight azimuth angle may extend out of the threshold when the extreme deviations are added to the initial values. However, the HIAC can still present a satisfactory control accuracy of the angle compared with the DDPG and LQR. Moreover, in terms of the safety distance, the HIAC can stay within a safe distance of 100 m to 600 m from the MAV, which reaches the distance control target. However, the DDPG and LQR gradually extend out of the safe distance as the initial values vary. Statistically, compared with LQR and DDPG, the HIAC has smaller values in both the RMS error and maximum error of these four evaluation indices. In summary, the performance of the HIAC in formation control is better than that of the other two methods, which is consistent with the simulation results under nominal conditions. It is believed that the HIAC has significant adaptability to the varying initial conditions.

5. Conclusions

In this study, a novel HIAC method was proposed, which is able to enhance the smoothness and executability of control commands and improve the control performance of the MAV/UAV flight formation. First, based on the idea of “meta-action” in hybrid reinforcement learning, the formation control was modeled as a continuous–discrete space control problem. Then, we proposed the framework of the HIAC, and the hybrid intelligent agent model based on the DDPG/DDQN was designed through multi-channel decoupling. Finally, we carried out simulations of nominal conditions and 100 Monte Carlo simulations in varying initial conditions. The simulation results showed that, compared with the traditional LQR and DDPG, the HIAC has better performance of high control precision and rapid convergence. Meanwhile, the adaptability of HIAC to the varying initial conditions was verified as well.

For further practical applications, HIAC can gradually support practical scenarios such as formation military operations and terrain surveys. In particular, two aspects should be considered when applying HIAC. The first is the reliability of the method. HIAC should be preliminarily trained with a large number of ground tests before the real flights, to ensure that intelligent control gradually takes authority over traditional flight-control methods. The second is the portability of the method. At present, the method supports the deployment of reinforcement learning on hardware such as DSP, and FPGA, and can realize airborne portability and the online training of agent models.

Author Contributions

Conceptualization, methodology, software, validation, formal analysis, resources, data curation, writing—original draft preparation, L.Z. (Luodi Zhao); writing—review and editing, Y.L. and Q.P.; visualization, investigation, Y.L.; supervision, project administration, funding acquisition, L.Z (Long Zhao). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Science Foundation of China, grant number 42274037 and 41874034, the National key research and development program of China, grant number 2020YFB0505804 and the Beijing Natural Science Foundation, grant number 4202041.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Lei, L.; Wang, T.; Jiang, Q. Key Technology Develop Trends of Unmanned Systems Viewed from Unmanned Systems Integrated Roadmap 2017—2042. Unmanned Syst. Technol. 2018, 1, 79–84. [Google Scholar]
Mishory, J. DARPA Solicits Information for New Lifelong Machine Learning Program. Inside Pentagon 2017, 33, 10. [Google Scholar]
Pittaway, N. Loyal Wingman. Air Int. 2019, 96, 12–13. [Google Scholar]
Oh, K.; Park, M.; Ahn, H. A survey of multi-agent formation control. Automatica 2015, 53, 424–440. [Google Scholar] [CrossRef]
Wang, H.; Liu, S.; Lv, M.; Zhang, B. Two-Level Hierarchical-Interaction-Based Group Formation Control for MAV/UAVs. Aerospace 2022, 9, 510. [Google Scholar] [CrossRef]
Choi, I.S.; Choi, J.S. Leader-Follower formation control using PID controller. In Proceedings of the International Conference on Intelligent Robotics & Applications, Montreal, QC, Canada, 3–5 October 2012. [Google Scholar]
Gong, Z.; Zhou, Z.; Wang, Z.; Lv, Q.; Xu, Q.; Jiang, Y. Coordinated Formation Guidance Law for Fixed-Wing UAVs Based on Missile Parallel Approach Metho. Aerospace 2022, 9, 272. [Google Scholar] [CrossRef]
Liang, Z.; Ren, Z.; Shao, X. Decoupling trajectory tracking for gliding reentry vehicles. IEEE/CAA J. Autom. Sin. 2015, 2, 115–120. [Google Scholar]
Kuriki, Y.; Namerikawa, T. Formation Control of UAVs with a Fourth-Order Flight Dynamics. J. Control. Meas. Syst. Integr. 2014, 7, 74–81. [Google Scholar] [CrossRef]
Kuriki, Y.; Namerikawa, T. Consensus-based cooperative formation control with collision avoidance for a multi-UAV system. In Proceedings of the American Control Conference, Portland, OR, USA, 4–6 June 2014. [Google Scholar]
Atn, G.M.; Stipanovi, D.M.; Voulgaris, P.G. Collision-free trajectory tracking while preserving connectivity in unicycle multi-agent systems. In Proceedings of the American Control Conference, Washington, DC, USA, 17–19 June 2013. [Google Scholar]
Tsankova, D.D.; Isapov, N. Potential field-based formation control in trajectory tracking and obstacle avoidance tasks. In Proceedings of the Intelligent Systems, Sofia, Bulgaria, 6–8 September 2012. [Google Scholar]
Hu, J.; Wang, L.; Hu, T. Autonomous Maneuver Decision Making of Dual-UAV Cooperative Air Combat Based on Deep Reinforcement Learning. Electronics 2022, 11, 467. [Google Scholar] [CrossRef]
Luo, Y.; Meng, G. Research on UAV Maneuver Decision-making Method Based on Markov Network. J. Syst. Simul. 2017, 29, 106–112. [Google Scholar]
Yang, Q.; Zhang, J.; Shi, G. Maneuver Decision of UAV in Short-Range Air Combat Based on Deep Reinforcement Learning. IEEE Access 2020, 8, 363–378. [Google Scholar] [CrossRef]
Li, Y.; Han, W.; Wang, Y. Deep Reinforcement Learning with Application to Air Confrontation Intelligent Decision-Making of Manned/Unmanned Aerial Vehicle Cooperative System. IEEE Access 2020, 99, 67887–67898. [Google Scholar] [CrossRef]
Wang, X.; Gu, Y.; Cheng, Y. Approximate Policy-Based Accelerated Deep Reinforcement Learning. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 1820–1830. [Google Scholar] [CrossRef] [PubMed]
Hasselt, H.V.; Guez, A.; Silver, D. Deep Reinforcement Learning with Double Q-Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Canberra, Australia, 30 November–5 December2015. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D. Playing Atari with deep reinforcement learning. arXiv 2013, arXiv:1312.5602. [Google Scholar]
Silver, D.; Huang, A.; Maddison, C.J. Mastering the game of go with deep neural networks and the tree search. Nature 2016, 529, 484. [Google Scholar] [CrossRef]
Silver, D.; Lever, G.; Heess, N. Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 21–26 June 2014. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv 2015, arXiv:1502.03167. [Google Scholar]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
Wada, D.; Araujo-Estrada, S.A.; Windsor, S. Unmanned Aerial Vehicle Pitch Control Using Deep Reinforcement Learning with Discrete Actions in Wind Tunnel Test. Aerospace 2021, 8, 18. [Google Scholar] [CrossRef]
Haarnoja, T.; Zhou, A.; Abbeel, P. Soft Actor-Critic Algorithms and Applications. arXiv 2018, arXiv:1812.05905. [Google Scholar]
Heess, N.; Silver, D.; Teh, Y.W. Actor-critic reinforcement learning with energy-based policies. In Proceedings of the Tenth European Workshop on Reinforcement Learning, Edinburgh, UK, 30 June–1 July 2012. [Google Scholar]
Schaul, T.; Quan, J.; Antonoglou, I. Prioritized experience replay. arXiv 2015, arXiv:1511.05952. [Google Scholar]
Hu, Z.; Wan, K.; Gao, X.; Zhai, Y.; Wang, Q. Deep Reinforcement Learning Approach with Multiple Experience Pools for UAV’s Autonomous Motion Planning in Complex Unknown Environments. Sensors 2020, 20, 1890. [Google Scholar] [CrossRef] [PubMed]
Neunert, M.; Abdolmaleki, A.; Wulfmeier, M. Continuous-Discrete Reinforcement Learning for Hybrid Control in Robotics. In Proceedings of the Conference on Robot Learning, Virtual Event, 30 October–1 November 2020. [Google Scholar]
Xiong, J.; Wang, Q.; Yang, Z. Parametrized Deep Q-Networks Learning: Reinforcement Learning with Discrete-Continuous Hybrid Action Space. arXiv 2018, arXiv:1810.06394. [Google Scholar]
Anderson, M.R.; Robbins, A.C. Formation flight as a cooperative game. In Proceedings of the AIAA Guidance, Navigation, and Control Conference and Exhibit, Boston, MA, USA, 10–12 August 1998. [Google Scholar]
Kelley, H.J. Reduced-order modeling in aircraft mission analysis. AIAA J. 2015, 9, 349–350. [Google Scholar] [CrossRef]
Williams, P. Real-time computation of optimal three-dimensional aircraft trajectories including terrain-following. In Proceedings of the AIAA Guidance, Navigation, and Control Conference and Exhibit, Keystone, CO, USA, 24–26 August 2006. [Google Scholar]
Wang, X.; Guo, J.; Tang, S. Entry trajectory planning with terminal full states constraints and multiple geographic constraints. Aerosp. Sci. Technol. 2019, 84, 620–631. [Google Scholar] [CrossRef]
Snell, S.A.; Enns, D.F.; Garrard, W.L. Nonlinear inversion flight control for a supermaneuverable aircraft. In Proceedings of the AIAA Guidance, Navigation, and Control Conference and Exhibit, Portland, OR, USA, 20–22 August 1990. [Google Scholar]
Dukeman, G. Profile-Following Entry Guidance Using Linear Quadratic Regulator Theory. In Proceedings of the AIAA Guidance, Navigation, and Control Conference and Exhibit, Monterey, CA, USA, 5–8 August 2002. [Google Scholar]
Wen, Z.; Shu, T.; Hong, C. A simple reentry trajectory generation and tracking scheme for common aero vehicle. In Proceedings of the AIAA Guidance, Navigation, and Control Conference, Minneapolis, MN, USA, 13–16 August 2012. [Google Scholar]
Bryson, A.E.; Ho, Y. Applied Optimal Control. Technometrics 1979, 21, 3. [Google Scholar]

Figure 1. The forces on the center of gravity of the aircraft.

Figure 2. Dual aircraft formation for MAV/UAV.

Figure 3. The framework of the HIAC.

Figure 4. The framework of the desired commands solver.

Figure 5. The formation flight trajectory of MAV and UAV of three methods.

Figure 6. The control commands of the MAV generated by the LQR, DDPG, and HIAC, respectively, with reference commands of the MAV. The results of the thrust, the thrust, the normal overload, and the bank angle are presented in (a–c), respectively.

Figure 7. The errors between the control commands of the LQR, DDPG, HIAC and the reference commands of the MAV. The errors of the thrust, the thrust, the normal overload, and the bank angle are presented in (a–c), respectively.

Figure 8. The change of the three controlled state of the UAV generated by the LQR, DDPG and HIAC. The results of the velocity, the flight path angle, and the flight azimuth angle are presented in (a–c), respectively.

Figure 9. The deviation of the velocity, the flight path angle, the flight azimuth angle and the relative distance are presented in (a–d), respectively.

Figure 10. Monte Carlo simulation results of LQR, DDPG and HIAC.

Table 1. The experimental parameter settings.

Parameters	Settings	Parameters	Settings
$T$ (s)	50	$η_{\min}$	0
$Δ T$ (s)	0.1	$η_{\max}$	1
$T_{\max}$ (lb)	25,600	$n_{\max}$	6
$m$ (kg)	14,470	$σ_{\max}$ (rad)	$π / 2$
$g$ (m/s²)	9.81	$D_{Δ \max}$ (m)	600
$S$ (ft²)	400	$D_{Δ \min}$ (m)	100
$C_{D_{P}}$	0.02	$V_{Δ \max}$ (m/s)	50
$C_{D_{I}}$	0.1	$ψ_{Δ \max}$ (rad)	0.2
$τ_{δ}$ (s)	0.6	$γ_{Δ \max}$ (rad)	0.2
$τ_{n}$ (s)	0.5	$λ_{V_{c} \max}$ (m/s)	50
$τ_{σ}$ (s)	0.5	$λ_{γ_{c} \max}$ (rad)	$π / 2$
$ϖ_{V}$ (s)	0.3	$λ_{ψ_{c} \max}$ (rad)	$π / 2$
$ϖ_{γ}$ (s)	0.2	$\partial γ_{c}$ (rad)	$π / 180$
$ϖ_{ψ}$ (s)	0.2	$\partial ψ_{c}$ (rad)	$π / 180$

Table 2. The training parameters of DDPG/DDQN.

Parameters	Settings
Learning Rate	0.0001
Max Episode	25,000
Batch Size (DDPG)	256
Batch Size (DDQN)	64
Discount Factor	0.99
Experience Buffer Length	1 × 10⁶

Table 3. RMS and maximum errors of the four states of the LQR, DDPG, and HIAC in nominal conditions.

Controller		Velocity (m/s)	Flight Path Angle (rad)	Flight Azimuth Angle (rad)	Relative Distance (m) Safe Distance [100, 600]
LQR	RMS	5.6957	0.4737	0.6202	516.7072
LQR	Max.	15.3307	0.8027	0.7833	710.2799
DDPG	RMS	8.3953	0.2183	0.4493	444.1190
DDPG	Max.	13.3379	0.5241	0.5315	610.6078
HIAC	RMS	5.3647	0.1401	0.2174	460.0709
HIAC	Max.	6.0780	0.3586	0.2391	552.1845

Table 4. Uniform distribution of deviations for the six initial values.

Numbers of Monte Carlo Simulations	X (m)	Y (m)	Z (m)	Velocity (m/s)	Flight Path Angle (rad)	Flight Azimuth Angle (rad)
100	[−50, 550]	[−50, 550]	[−1000, 1000]	[−100, 100]	[− $π$ /18, $π$ /18]	[− $π$ /18, $π$ /18]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, L.; Liu, Y.; Peng, Q.; Zhao, L. A Dual Aircraft Maneuver Formation Controller for MAV/UAV Based on the Hybrid Intelligent Agent. Drones 2023, 7, 282. https://doi.org/10.3390/drones7050282

AMA Style

Zhao L, Liu Y, Peng Q, Zhao L. A Dual Aircraft Maneuver Formation Controller for MAV/UAV Based on the Hybrid Intelligent Agent. Drones. 2023; 7(5):282. https://doi.org/10.3390/drones7050282

Chicago/Turabian Style

Zhao, Luodi, Yemo Liu, Qiangqiang Peng, and Long Zhao. 2023. "A Dual Aircraft Maneuver Formation Controller for MAV/UAV Based on the Hybrid Intelligent Agent" Drones 7, no. 5: 282. https://doi.org/10.3390/drones7050282

Article Menu

A Dual Aircraft Maneuver Formation Controller for MAV/UAV Based on the Hybrid Intelligent Agent

Abstract

1. Introduction

2. Mathematical Modeling

2.1. UAV Dynamic Model

2.2. Formation Control Targets

2.2.1. Flight Velocity Control Targets

2.2.2. Flight Distance Control Targets

3. Design of the HIAC

3.1. Desired Command Solver

3.1.1. Framework of Hybrid Intelligent Agent Based on DDPG/DDQN

3.1.2. State Space $S$

3.1.3. Action Space $A$

3.1.4. Reward Function $R_{\sum}$

3.2. Dynamic Inversion Controller

3.3. First-Order Lag Filter

4. Simulation Validation

4.1. Simulation Design

4.2. Basic Principles of LQR

4.3. Experiment of Nominal Conditions

4.4. Monte Carlo Experiments

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

A Dual Aircraft Maneuver Formation Controller for MAV/UAV Based on the Hybrid Intelligent Agent

Abstract

1. Introduction

2. Mathematical Modeling

2.1. UAV Dynamic Model

2.2. Formation Control Targets

2.2.1. Flight Velocity Control Targets

2.2.2. Flight Distance Control Targets

3. Design of the HIAC

3.1. Desired Command Solver

3.1.1. Framework of Hybrid Intelligent Agent Based on DDPG/DDQN

3.1.2. State Space S

3.1.3. Action Space A

3.1.4. Reward Function R ∑

3.2. Dynamic Inversion Controller

3.3. First-Order Lag Filter

4. Simulation Validation

4.1. Simulation Design

4.2. Basic Principles of LQR

4.3. Experiment of Nominal Conditions

4.4. Monte Carlo Experiments

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.1.2. State Space $S$

3.1.3. Action Space $A$

3.1.4. Reward Function $R_{\sum}$