Study on the Glider Soaring Strategy in Random Location Thermal Updraft via Reinforcement Learning

Cui, Yunxiang; Yan, De; Wan, Zhiqiang

doi:10.3390/aerospace10100834

Open AccessArticle

Study on the Glider Soaring Strategy in Random Location Thermal Updraft via Reinforcement Learning

by

Yunxiang Cui

,

De Yan

and

Zhiqiang Wan

^*

School of Aeronautic Science and Engineering, Beihang University, Beijing 100191, China

^*

Author to whom correspondence should be addressed.

Aerospace 2023, 10(10), 834; https://doi.org/10.3390/aerospace10100834

Submission received: 31 July 2023 / Revised: 7 September 2023 / Accepted: 22 September 2023 / Published: 25 September 2023

(This article belongs to the Section Aeronautics)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Soaring birds can use thermal updrafts in natural environments to fly for long periods or distances. The flight strategy of soaring birds can be implemented to gliders to increase their flight time. Currently, studies on soaring flight strategies focus on the turbulent nature of updrafts while neglecting the random characteristics of its generation and disappearance. In addition, most flight strategies only focus on utilizing updrafts while neglecting how to explore it. Therefore, in this paper, a complete flight strategy that seeks and uses random location thermal updrafts is mainly emphasized and developed. Moreover, through the derivation of flight dynamics and related formulas, the principle of gliders acquiring energy from thermal updrafts is explained through energy concepts. This concept lays a theoretical foundation for research on soaring flight strategies. Furthermore, the method of reinforcement learning is adopted, and a perception strategy suitable for gliders that considers the vertical ground speed, vertical ground speed change rate, heading angle, and heading angle change as the main perception factors is developed. Meanwhile, an area exploring strategy was trained by reinforcement learning, and the two strategies were combined into a complete flight strategy that seeks and uses updrafts. Finally, based on the guidance of the soaring strategy, the flight of the glider in the simulation environment is tested. The soaring strategy is verified to significantly improve the flight time lengths of gliders.

Keywords:

soaring strategy; thermal updraft; reinforcement learning; glider; long-endurance

1. Introduction

With the continuous expansion of unmanned aerial vehicle (UAV) task requirements and the development of materials, computers, energy equipment, and other technologies, the performance of UAVs in all aspects is being continuously improved. UAVs are gradually being developed to have longer endurance, more intelligence, and autonomy.

Flight time is one of the most important performance characteristics of UAVs. During observations, surveys, aerial photography, and other flight missions, the flight time is an extremely important performance parameter. A longer flight time means that UAVs can complete reconnaissance and surveillance tasks with a wider range and more complete information.

Energy is one of the most important factors related to the endurance of UAVs. Due to the size and load limitations of UAVs, the onboard energy is always limited; thus, exploring the energy supplements during flight is an important research direction of long-endurance UAVs. As the main external energy supply for UAVs, solar energy has been studied for a long time. In the daytime, UAVs use solar cells to convert solar radiation into electrical energy to maintain the operation of the power system, avionics systems, and the payload, while some of the energy is stored in batteries. At night, such UAVs use the energy in the battery to fly [1]. In theory, solar aircraft can achieve an energy balance and perform long-endurance or even permanent flight; however, the conversion efficiency of solar cells is low, uncertain weather factors can also lead to an unstable solar energy supply, and solar equipment occupies a share of the UAV payload. Thus, actual application effects are not ideal.

In recent years, the long-endurance flight mode of soaring birds using wind energy has attracted the attention of researchers [2,3,4,5]. Different types of birds have different flight strategies, and some large birds do not fly mainly by flapping, but by intermittent flapping or soaring [6]. By constantly searching for and utilizing thermal updrafts, these birds can fly for long periods without even flapping their wings. This concept inspired a new design idea for UAVs to achieve long flights with low energy consumption. Similar to solar energy, wind energy is a kind of natural energy that widely exists in the troposphere of the atmosphere; however, due to the randomness and complexity of the wind energy intensity and location, using wind energy is also difficult. NASA has conducted research on soaring UAVs. According to theoretical calculations, UAVs with a 2 h flight time can fly for up to 14 h by using thermal updrafts under good weather conditions [7,8].

In addition, the intelligence and autonomy of UAVs is also important to consider for future development. With the development of computer algorithms, processing capabilities, and other technologies, researchers have made great progress in mission planning, strategy decisions, trajectory control, and other aspects with regards to UAVs [9,10]. The control mode of UAVs has also been improved from semi-autonomous forms, such as pre-programming and pre-planning, to fully autonomous and intelligent forms. At present, UAVs cannot be completely separated from thinking factors, can only perform deterministic tasks, and cannot deal with fuzzy tasks well.

To achieve the intelligence and autonomy of UAVs, machine learning can be used. Machine learning can complete some tasks that cannot be completed by conventional methods or are extremely complex [11,12,13,14,15,16,17]. Reinforcement learning is a learning method that takes environmental feedback as input and dynamic programming techniques as guidance. Reinforcement learning is a kind of dynamic learning and a training method similar to human learning. In reinforcement learning, the agents trained to complete the target task interact with the task simulation environment constantly, accumulate experience constantly, guide the agents to iterate and update in a better direction by setting appropriate reward functions, and learn in dynamic interactions to finally obtain an optimal agent that can complete the target task [18,19].

There are many algorithms in reinforcement learning [20,21,22,23]. In this paper, a soaring strategy is studied. This strategy considers random factors and continuous processes. At present, the reinforcement learning algorithm based on actors and critics is very suitable for this kind of method; thus, the SAC algorithm, which is a stochastic policy reinforcement learning algorithm, is adopted in this paper to perform the soaring strategy. Moreover, this algorithm does not directly provide results but outputs a probability distribution and generates results based on this probability [24]. The SAC reinforcement learning algorithm performs well in continuous control, strategy research, and other aspects and is suitable for solving the problems studied in this paper [25,26,27].

Reddy et al. [28,29] used the reinforcement learning method to study the soaring characteristics of gliders, established a complex atmospheric model, and studied the strategy of gliders climbing in complex airflow environments using thermal updrafts. Moreover, they studied and inferred how birds perceive thermal updrafts by training with different state spaces. In actual flight, thermal updrafts will not always be available in the environment. Additionally, the complete flight strategy of gliders that seek and use updrafts was not considered in this study. The effects of the key parameters of reward functions on the training results were also not discussed.

The soaring flight strategy studied in this paper is developed to perform autonomous, long-endurance flight. This soaring strategy considers the whole flight process from seeking updrafts to using updrafts. The basic principle of UAVs that use thermal updrafts to obtain energy is described through the derivation of the energy expression of UAVs. Afterward, a flight simulation environment with a random location thermal updraft is built. Furthermore, in this paper, the reinforcement learning method is used to study the perception strategy and exploration strategy of UAVs soaring with thermal updrafts based on the existing conditions. Finally, the soaring strategy is obtained through training so that the glider can carry out autonomous, long-endurance flight guided by this soaring strategy.

2. UAV Energy Acquisition Principle in Thermal Updrafts

UAVs can fly without power for long periods in environments with thermal updrafts. This principle can be derived from the energy point of view and combined with the flight dynamics formula to reflect the energy import of thermal updrafts to gliders.

From the perspective of energy, the total mechanical energy of UAVs includes the gravitational potential energy and kinetic energy of relative airflow:

E = \frac{1}{2} m V_{a}^{2} + m g h,

(1)

where

E

represents the energy, m is the mass, V_a is the airspeed of the aircraft, g is the gravitational acceleration, and h is the flight altitude.

The acquisition or dissipation of the total energy can be determined by the change rate of energy. By taking the derivative of Equation (1), we can obtain the change rate of energy with time as follows:

\dot{E} = m V_{a} {\dot{V}}_{a} + m g \dot{h},

(2)

The actual flight situation shows that to maintain or increase the total energy constant, we need to satisfy

\dot{E}

> 0. Dynamic modeling of the aircraft was carried out to make the variables in the formula more intuitive. First, we established a ground coordinate system

o_{g} x_{g} y_{g} z_{g}

and air flow coordinate system

o_{a} x_{a} y_{a} z_{a}

. The transformation between these two coordinate systems is determined by three Euler angles, namely,

ψ_{a}

,

θ_{a}

, and

ϕ_{a}

.

According to Newton’s second law, the mechanical equation in the ground coordinate system can be obtained from the force analysis in Figure 1:

[\begin{matrix} m {\ddot{x}}_{g} \\ m {\ddot{y}}_{g} \\ m {\ddot{z}}_{g} \end{matrix}] = L_{g a} [\begin{matrix} - D \\ C \\ - L \end{matrix}] + [\begin{matrix} 0 \\ 0 \\ G \end{matrix}],

(3)

where

\ddot{x_{g}}

,

\ddot{y_{g}}

, and

\ddot{z_{g}}

represent the three components of acceleration in the ground coordinate system, L_ga is the coordinate transformation matrix between the ground coordinate system and the air flow coordinate system, D is the drag, C is the lateral force, L is the lift, and G represents the gravity.

The expression for

L_{g a}

(Equation (4)) is as follows:

L_{g a} = [\begin{matrix} \cos θ_{a} \cos ψ_{a} & \sin θ_{a} s i n ϕ_{a} \cos ψ_{a} - \cos θ_{a} \sin ψ_{a} & \sin θ_{a} \cos ϕ_{a} \cos ψ_{a} + s i n ϕ_{a} \sin ψ_{a} \\ \cos θ_{a} \sin ψ_{a} & \sin θ_{a} s i n ϕ_{a} \sin ψ_{a} + \cos ϕ_{a} \cos ψ_{a} & \sin θ_{a} \cos ϕ_{a} \sin ψ_{a} - \sin ϕ_{a} \cos ψ_{a} \\ - \sin θ_{a} & s i n ϕ_{a} \cos θ_{a} & \cos ϕ_{a} \cos θ_{a} \end{matrix}],

(4)

where

ψ_{a}

,

θ_{a}

, and

ϕ_{a}

represent three Euler angles of the transformation between the ground coordinate system and the air flow coordinate system.

In a windy environment, the relationship between the ground speed, airspeed, and wind speed of the glider is as follows:

{\vec{V}}_{g} = {\vec{V}}_{a} + {\vec{V}}_{w},

(5)

where

\vec{V_{g}}

is the ground speed,

\vec{V_{a}}

is the airspeed, and

\vec{V_{w}}

is the wind speed.

By translating to three axes, we can obtain:

[\begin{matrix} \dot{x} \\ \dot{y} \\ \dot{z} \end{matrix}] = L_{g a} [\begin{matrix} V_{a} \\ 0 \\ 0 \end{matrix}] + [\begin{matrix} V_{w x} \\ V_{w y} \\ V_{w z} \end{matrix}],

(6)

where

\dot{x}

,

\dot{y}

, and

\dot{z}

are the projections of the ground speed on the three axes of the ground coordinate system, and

V_{w x}

,

V_{w y}

, and

V_{w z}

are the projections of the wind speed on the three axes of the ground coordinate system.

Thus, Equation (7) can be finally calculated from the equation set above, and

{\dot{V}}_{a}

and

\dot{z}

are brought into the energy change rate formula, where

\dot{z} = - \dot{h}

. The energy change rate can then be obtained (Equation (8)):

\{\begin{array}{l} \ddot{x} = - \frac{1}{m} {(\cos θ_{a} \cos ψ_{a} D + (\cos θ_{a} \sin ψ_{a} - \cos ψ_{a} \sin ϕ_{a} \sin θ_{a}) C + (\sin ϕ_{a} \sin ψ_{a} + \cos ϕ_{a} \cos ψ_{a} \sin θ_{a}) L} \\ \ddot{y} = \frac{1}{m} {(- \cos θ_{a} \sin ψ_{a} D + (\cos θ_{a} \cos ψ_{a} + \sin ϕ_{a} \sin θ_{a} \sin ψ_{a}) C + (\cos ψ_{a} \sin ϕ_{a} - \cos ϕ_{a} \sin θ_{a} \sin ψ_{a}) L} \\ \ddot{z} = \frac{1}{m} (m g + \sin θ_{a} D + \cos θ_{a} \sin ϕ_{a} C - \cos θ_{a} \cos ϕ_{a} L) \\ {\dot{θ}}_{a} = \frac{1}{m V_{a}} (- m g \cos θ_{a} + m \cos ψ_{a} \sin θ_{a} {\dot{V}}_{w x} + m \sin ψ_{a} \sin θ_{a} {\dot{V}}_{w y} + m \cos θ_{a} {\dot{V}}_{w z} + \cos ϕ_{a} L - \sin ϕ_{a} C) \\ {\dot{ψ}}_{a} = \frac{1}{m V_{a} \cos θ_{a}} (m \sin ψ_{a} {\dot{V}}_{w x} - m \cos ψ_{a} {\dot{V}}_{w y} + \cos θ_{a} C + \sin ϕ_{a} L) \\ {\dot{V}}_{a} = \frac{1}{m} (- m g \sin θ_{a} - m \cos θ_{a} \cos ψ_{a} {\dot{V}}_{w x} - m \cos θ_{a} \sin ψ_{a} {\dot{V}}_{w y} + m \sin θ_{a} {\dot{V}}_{w z} - D) \end{array},

(7)

\dot{E} = - m \cos θ_{a} \cos ψ_{a} {\dot{V}}_{w x} V_{a} - m \cos θ_{a} \sin ψ_{a} {\dot{V}}_{w y} V_{a} + m \sin θ_{a} {\dot{V}}_{w z} V_{a} - V_{w z} m g - D V_{a}

(8)

In the thermal updraft model,

V_{w x}

and

V_{w y}

can be simplified to zero, and

V_{w z}

changes with the location. The energy change rate formula can be simplified as:

\dot{E} = m \sin θ_{a} {\dot{V}}_{w z} V_{a} - V_{w z} m g - D V_{a}

(9)

To express the effects of thermal updrafts on energy changes more intuitively, we changed the positive direction of the vertical wind speed to upward, and the expression can be changed into the following form:

\dot{E} = - m \sin θ_{a} {\dot{V}}_{w z} V_{a} + V_{w z} m g - D V_{a}

(10)

Equation (10) shows that the drag is the main factor of energy dissipation, and the thermal updraft can import energy for the glider. The higher the wind speed of the thermal updraft is, the greater the energy gain. Another influencing factor is related to the flight attitude and the change rate of the vertical wind speed. Because the sine value and the change rate of the vertical wind speed are small, they have little impact on the overall energy change.

3. Simulation Environment with Random Thermal Updrafts

To verify the advantages of the soaring strategy, a simulation environment with thermal updrafts needs to be established. This strategy mainly includes the UAV flight dynamics model and thermal updraft model.

A thermal updraft is an air rising phenomenon caused by uneven heating of the surface by solar radiation. The height is generally less than 1000 m, the speed is generally 3~5 m/s, and the area with high intensity can reach 6~8 m/s. The thermal updraft model used in this paper is the Gaussian model. On a horizontal plane, the vertical wind speed distribution is a Gaussian distribution, and the wind speed on the concentric circles with the center of the thermal as the center is the same. The expression of the vertical wind speed of point (x, y) on a horizontal plane is as follows:

v = \frac{1}{\sqrt{2 π} σ} \exp (- \frac{{(x - μ_{1})}^{2} + {(y - μ_{2})}^{2}}{2 σ^{2}}),

(11)

where v is the vertical wind speed of point (x, y),

μ_{1}

and

μ_{2}

are the mean values of two dimensions, and

σ

is the standard deviation.

On a horizontal plane, the closer to the center of the updraft (

μ_{1}

,

μ_{2}

), the greater the vertical wind speed is. The radius of the thermal updraft is approximately 2

σ

.

The form of the vertical wind speed curve in a horizontal plane is shown in Figure 2.

The vertical wind speed at the thermal updraft center is 8 m/s in this paper. The radius of the thermal updraft is approximately 200 m. In addition, the height of the thermal updraft is generally below 1000 m; therefore, when the height is less than 650 m, the wind speed distribution in the horizontal plane is the same at all heights. Above 650 m, the wind speed gradually decreases and disappears.

The flight dynamics model used in this paper is based on a glider. UAVs with small wing loads are more likely to use thermal updrafts for long endurance and low energy flights. The glider wing has a span of approximately 2 m, an area of approximately 0.5 m², and a weight of approximately 3 kg. This glider has a small wing load and a large lift–drag ratio, which is conducive to the use of updrafts. The model is shown below (Figure 3).

This glider has a high lift–drag ratio and smaller wing load, which can make it easier to soar in the thermal updraft. In this paper, the lift coefficient (Cl), drag coefficient (Cd), and lift-to-drag ratio (λ) for angles of attack (alpha) from −6° to 10° are calculated, and the curves obtained are as follows (Figure 4).

The maximum lift–drag ratio of this glider is approximately 21. At an angle of attack of 10°, stall does not occur. The lift coefficient in this angle of attack interval can be considered a linear change. After linear fitting is performed, the square of the correlation coefficient (r²) is 0.9999, indicating a high degree of linear fitting. In the subsequent calculation of this paper, the range of the flight angle of attack will not exceed the calculation range, so the lift coefficient can be calculated using the linear relationship. In the range of the calculated angle of attack, the relationship between the drag coefficient and angle of attack is close to a quadratic function. After quadratic function fitting is performed, the correlation coefficient is determined to be 0.9999, and the fitting degree is very high. Therefore, the drag coefficient of the text can also be calculated by fitting the quadratic function.

In the simulation flight in this paper, the glider does not have power, and the force diagram is as follows (Figure 5).

The flight mechanics model of the glider in the stable gliding state is as follows:

{\begin{matrix} G = D (α) \cdot s i n γ + L (α) \cdot c o s μ c o s γ \\ D (α) \cdot c o s γ = L (α) \cdot c o s μ s i n γ \\ m a_{y} = L (α) \cdot s i n μ \end{matrix}

(12)

The glider is mainly controlled by the inclination angle (μ). The side force caused by its lift changes with μ, and the glider will generate lateral acceleration to cause the glider to turn. The longitudinal motion of the glider is mainly controlled by the angle of attack (α). In this paper, α will be adjusted according to the change in μ so that the glider can fly at a stable glide speed.

Through these two models, a simulation environment was finally built to verify the soaring strategy. When UAVs are performing observations, surveys, aerial photography, and other major missions, they do not migrate long distances like birds, but fly within a certain range; therefore, in this study, a cubic space with a side length of 1500 m and a height of 1000 m was built. In this space, the thermal updraft appears at a random location and disappears after a certain period. In this study, the total time was set to 1 h, during which three thermal updrafts were generated, and each updraft randomly appeared at a certain location in the area and lasted for 20 min. Moreover, this simulation environment was utilized to verify the soaring strategy.

4. Perceptual Hovering Strategy in Thermal Updrafts

Based on the existing technology and hardware conditions, the perception ability and data processing ability of small UAVs to the environment are far inferior to those of soaring birds. Therefore, the reinforcement learning method is used in this paper to study the perception factors of aircraft to thermal updrafts.

The reinforcement learning algorithm of soft actor-critic (SAC), which is a random strategy reinforcement learning method, is adopted in this paper. The main aim of this paper is to develop a soaring strategy and train an agent that can sense and use thermal updrafts for autonomous long-endurance flight by setting an appropriate state space, action space, and reward function.

First, we need to set some necessary parameters in the reinforcement learning algorithm. To train the reinforcement learning algorithm, the reward discount rate, smoothing constant, neural network learning rate, and other parameters are considered. In this paper, the reward discount rate is set to 0.99. The closer this parameter is to 1, the more emphasis is placed on long-term rewards. The training in this paper is a sustained flight strategy, so we place more emphasis on long-term rewards. The smoothing constant is set to 0.003. This parameter represents the speed of the parameter updates for the neural network. The closer the value is to 1, the slower the parameter updates for the neural network. The training may tend to be more stable, but the convergence time will greatly increase. After comparative training, we selected a smaller factor that provides sufficient training stability and a greatly reduced convergence time. In the SAC algorithm, there are three kinds of networks: strategic neural networks, action value neural networks, and state value neural networks. They all have their own neural network learning rates. The learning rate of the neural network is equivalent to the update step size. A small learning rate requires many updates before reaching convergence. Too large of a learning rate causes drastic updates that lead to divergent behaviors. Based on experience, we set the learning rate of the action value neural network and state value neural network as 9 × 10⁻⁴ and the learning rate of the strategic neural network as 1 × 10⁻⁴.

At present, the data that gliders can perceive independently without relying on external input include the position of the aircraft in the ground axis coordinate system, the ground speed vector of the aircraft, the component of the airspeed vector in the aircraft fuselage axis direction, the aircraft attitude angle, and the aircraft attitude angular velocity.

The state space cannot include, however, all the perceptive quantities because some state quantities do not contribute to the thermal updraft sensing of the aircraft or even play the opposite role. In the simplified dynamic model, the main influence of the thermal updraft on the glider is its position and ground speed vector. Through training research, we found that when the state space contains the position information of the glider, the trained agent does not have extensibility. When the position of the thermal updraft changes or the initial position of the glider changes, it is difficult for the glider to find the thermal updraft and hover in it. We finally take the vertical ground speed, vertical ground speed change rate, heading angle, and heading angle change as the state space in this paper.

According to the flight dynamics model in this paper, the left and right turns of the glider are mainly controlled by the angle of inclination, and the longitudinal direction of the glider is mainly controlled by adjusting the angle of attack to maintain the aircraft in balance. Therefore, the action space adopted in this paper is the angle of inclination.

Because the reward function is an important factor that guides the training direction of the agent, it is necessary to give the agent appropriate guidance so that it can converge to the correct training direction; however, adding too many subjective factors that affect the training results is not allowed. Soaring birds mainly use thermal updrafts to fly for long periods. Moreover, soaring birds gain altitude through thermal updrafts, use gaps to find the next thermal updraft to gain altitude again, and continue flying. Therefore, we can take the flight altitude as a reward, guide the glider to achieve the goal of a high altitude, and achieve long-endurance flight. In addition, we can directly use the flight time as the reward target. The flight altitude and flight time are both important factors for rewards; therefore, theoretically, three kinds of reward functions can be set. We can set the flight time and flight altitude as reward functions separately or merge the flight time and flight altitude in a certain proportion as reward functions.

We utilized three reward functions for training and compare the results in this paper. The three reward functions are set as follows:

R e w a r d = {\begin{matrix} - 500, & h \leq 3 \\ 1, & h > 3 \end{matrix},

(13)

R e w a r d = {\begin{matrix} - 500, & h \leq 3 \\ h / 1000, & h > 3 \end{matrix},

(14)

R e w a r d = {\begin{matrix} - 500, & h \leq 3 \\ 1 + \frac{h}{1000}, & h > 3 \end{matrix},

(15)

where h is the flight altitude. When h ≤ 3, this round of flight is over, and the agent will receive a significantly negative reward. When h > 3, for each step taken, the reward for the agent will be increased.

In Equation (13), the reward for the agent increased by one point for each step. Thus, the more steps the agent adheres to, the longer its flight time will be, and the more rewards it will receive.

In Equation (14), the agent received a positive reward based on the altitude for each step.

In Equation (15), the agent received one point and a positive reward based on the altitude for each step. To ensure that the rewards of the flight altitude and flight time are on the same order of magnitude, the reward of the flight altitude is the flight altitude divided by 1000.

In the figure of the results, the black dot represents the initial position of the glider. The colorful curve represents the flight trajectory. If the color of the curve is close to the cold tone (blue), it means that the flight drops sharply at this moment. In contrast, the closer the color is to the warm tone (red), the faster the flight rises at this moment. The gray concentric circle on the lower plane represents the position of the thermal updraft.

Some typical training results are as follows (Figure 6).

Result (a) is obtained by training with a reward function that is based on the flight time. In training result (a), after entering the thermal updraft, the glider will circle around the center of the hot updraft with a large radius, and the flight altitude changes very little. Based on the flight time reward, the glider’s circling radius changes very little, and the circling is performed around the center of the thermal updraft, not through the center of the thermal updraft. If the agent does not land within the training steps, it will maximize the reward. In other words, when the flight time is set as a reward, the glider will not actively increase its altitude; therefore, the flight trajectory of the trained agent does not rise as sharply as that of the agent rewarded with altitudes. Thermal updrafts in nature, however, are constantly generated and dissipated at different locations. This flight trajectory does not significantly increase the glider’s altitude, which shortens the time for the glider to seek the next updraft and is not conducive to the glider’s sustained flight. Nevertheless, many kinds of training results can be obtained based on the flight time as the reward function, with the glider sometimes hovering up, sometimes hovering down, and sometimes hovering at a fixed height. The diversity of the results also represents the uncertainty of the training. Although satisfactory results can sometimes be obtained, the flight time is not a suitable reward function.

The results of the training based on the flight altitude reward or hybrid reward can be either represented by result (b) or result (c). The reward values for these two results are similar, but there are some differences in the flight trajectory characteristics.

Training result (b) shows that after entering the thermal updraft, the glider quickly climbs to the height by continuously circling through the center of the thermal updraft. Finally, the glider achieves a fixed height and circles at the height where the thermal updraft gradually weakens and disappears. This flight trajectory has a small flying radius, and there is a certain distance between the center of the hovering and the center of the thermal updraft. The main reason for this phenomenon is that in the case of including the altitude as a reward, the glider will give priority to obtaining the altitude if it wants to obtain a greater reward. According to the Gaussian thermal updraft model, the vertical wind speed is the largest at the center of the thermal updraft. The glider will not only tend to fly toward the center of the thermal updraft but also hover in it. As shown in Figure 6, the flight mode of continuously circling through the center of the thermal updraft is finally formed. During the altitude rise, there is a certain distance between the hovering center of the glider and the center of the thermal updraft, which poses a potential risk of detachment from the updraft.

In training result (c), the glider enters an updraft and flies around the center of the updraft, continuously increasing in the altitude until it hovers at a fixed altitude where the updraft gradually weakens and disappears. The hovering center of this flight trajectory basically coincides with the center of the thermal updraft. Although the glider did not fly through the center of the thermal updraft, it circled around the center with a small radius. The vertical wind speed in this area is also very high, which can enable the glider to quickly achieve higher altitudes. This flight trajectory not only increases the flight altitude and provides more time to find the next updraft, but also avoids the risk of flying to the edge of the updraft and causing detachment from the updraft.

The test results show that agents that are trained based on these three reward functions can all cause gliders to perform long-term hovering flights. The training results that are achieved based on the flight time as a reward function are not sufficient. The reward function that includes the flight altitude is more conducive to the sustained flight of the glider in an environment where thermal updrafts are generated and disappear at random positions. Therefore, we have retained an agent that can guide the glider to hover with a small radius and make the hovering center basically coincide with the center of the thermal updraft. Its flight trajectory is as follows (Figure 7):

5. Regional Exploration Strategy in Certain Areas

Because the time and place of the occurrences of thermal updrafts are random, thermal updrafts sometimes do not occur in the entire space; thus, the randomness factor should be considered in the soaring strategy. When the glider is not at the edge of a thermal updraft, it cannot locate the thermal updraft or predict the time and place in which the thermal updraft will appear. The only thing the glider can do is explore a certain area; therefore, the final soaring strategy of this paper also needs to include the regional random exploration strategy when there is no wind.

The SAC reinforcement learning algorithm was still used to train regional random exploration agents. We set up a training environment for the agent, selected the state space and action space, and set up appropriate reward functions.

The main task of the area exploration agent is to randomly explore the area within a certain range to find the thermal updraft when the glider is not exposed to the updraft. The main purpose is to enable the glider to fly through more areas, explore every corner of the area, and make it easier to locate thermal updrafts. The exploration range of the glider is defined as a circular area with the glider as the center and a certain length as the radius. The total area of exploration is the total area swept by the circle along the flight path. Our training goal is to enable the agent to maximize the exploration coverage in a limited time. The radius of the glider’s exploration range should not be too large or too small. In this paper, the radius of the glider’s exploration range is set to 200 m because the standard deviation of the Gaussian model of the thermal updraft in the perceptual hovering agent is 100. Outside 200 m, the vertical wind speed is less than 1 m/s, which is difficult for the glider to perceive.

According to the thermal updraft model adopted in this paper, the thermal updraft distribution does not change with a height under 650 m; therefore, the exploration agent training environment can be set as a two-dimensional plane. The training environment is set as a square plane with a side length of 1500 m, and the glider is simplified as a particle.

Regarding some necessary parameters of the reinforcement learning algorithm, we used the same settings as the agents of the hovering strategy previously trained. The reward discount rate was set to 0.99. The smoothing constant was set to 0.003. The learning rate of the action value neural network and state value neural network was set to 9 × 10⁻⁴, and the learning rate of the strategic neural network was set to 1 × 10⁻⁴.

The state space as input includes the speed in the x and y directions in the rectangular plane coordinate system, the distance of the glider to the four boundaries, and a user-defined exploration area increment. The exploration area increment represents the increment of the total exploration area of the glider’s current step relative to the previous step. In such a setting, the information obtained by the glider is independent of the shape and size of the exploration area; therefore, the strategy obtained from the training can adapt to a variety of regional shapes. By adding the exploration area increment, the glider can determine whether the current step has increased the exploration area and determine the advantages and disadvantages of the current action to decide which action should be taken in the next step.

In the exploration training, the glider adopts the same flight dynamics model as the perceptual hovering strategy training; therefore, the action space is the same as before. The glider inclination angle is taken during the action, and its turning movement is controlled.

In one training round, the glider starts flying at a random point in the plane to avoid a single exploration track and increase the randomness of exploration. The end condition of the training round is that the glider flies out of the set boundary, or the total area of exploration reaches 85% or more. The agent training reward is divided into a step reward and a termination reward. In each step, the exploration area increment is set as a reward. If the glider flies out of the boundary and the flight is terminated, the termination reward is −1000. If the total exploration area of the glider reaches the standard and the flight is terminated, the termination reward is +1000. This reward setting can guide the glider to explore a larger area in a shorter time within a certain range.

In summary, in the training of the exploration strategy, the state space does not directly contain the position information of the glider, which enhances the random characteristics of its exploration. The reward is set to be related to the exploration area, ensuring the wide area characteristics of exploration. The agent trained in this way can randomly find thermal updrafts more easily during the limited period when the glider is descending.

The performances of the trained regional exploration agents are as follows (Figure 8):

To demonstrate the effectiveness of the exploration agent more convincingly, 1000 tests were conducted on the trained agent, and the statistics of the percentage of its exploration area in the total area are as follows (Figure 9).

The results show that the agent can conduct random and diverse area exploration. The exploration area in all the results can reach more than 60%. The average exploration area is 80.1%, and the median exploration area is 80.8%. In 98.3% of the results, the exploration area accounts for more than 70% of the area. A total of 88.8% of the test results showed that the exploration area accounted for more than 75% of the area. The results with exploration areas accounting for 75–85% were the most common, accounting for 81.2% of the results.

Therefore, this agent has the characteristics of random and wide exploration, which is conducive to coping with random thermal updrafts.

6. Glider Soaring Strategy Test in the Simulation Environment

After the simulation environment was constructed and the glider agent training was performed, the glider soaring strategy, which is composed of a perceptual hovering agent and a regional exploration agent, was finally obtained. Guided by the soaring strategy, the flight time of the glider in the simulation environment with random thermal updrafts was tested.

First, we used the glider to randomly explore an area without thermal updrafts. The total flight time of unpowered flight was approximately 2.5 min.

Then, we tested the flight of the glider in an environment with random thermal updrafts and selected the flight trajectory of one flight time up to 1 h as follows (Figure 10).

In the total flight time of 1 h, the glider used thermal updrafts to fly for 3464 s (approximately 57.7 min), and the glider flew for 136 s (approximately 2.3 min) in the area without thermal updrafts.

According to the energy change expression derived previously, we calculated the energy change in this flight simulation, including the energy input or loss change of each part, the energy change rate, and the total flight energy change. The change curve is described below.

The curves in Figure 11 represent the influence curves of the vertical wind speed, rate of the vertical wind speed, and drag resistance on the energy change during flight. According to the energy change expression derived previously, these three factors are the main factors that affect the energy change; however, the green dotted line shows that the influence of the rate of the vertical wind speed is very slight compared with the other two factors. The resulting rate of the energy change is always approximately zero. From the blue dotted-and-dashed line, we can see that the drag is the main factor of energy consumption; however, the drag changes little during the whole flight. This is mainly because the airspeed changes little during the flight. The yellow curve shows the influence of the vertical wind speed on the energy change. The dark-yellow line is the result of smoothing the light-yellow line. The curve shows obvious fluctuations, indicating that it has the greatest impact on the energy changes. The vertical wind speed is the main factor that affects the whole soaring process, so the thermal updraft plays a crucial role in soaring with low energy consumption.

The curve in Figure 12 represents the total energy change rate during the whole flight. Because the influence of the rate of the vertical wind speed and drag on the total energy change rate does not change much during flight, the trend of the curve of the total energy change rate and the curve of the influence of the vertical wind speed on the energy rate are basically the same.

According to the formula of the energy change rate and the curve in Figure 11, the energy change rate of the glider is less than zero during gliding, greater than zero during spiraling up, and equal to zero while hovering at a constant altitude.

The flight trajectory in Figure 9 shows that during the flight simulation, the glider made use of three thermal updrafts, during which there were three glides. No thermal updraft occurs at the initial position of the glider; therefore, the glider searches for a thermal updraft while gliding at first. In this very short time, the energy change rate is less than zero. Until the glider encounters the first random location thermal updraft generated in the simulation space, it starts to spiral up. At this stage, the energy change rate is greater than zero. Then, the glider hovers at a fixed height after reaching the height where the thermal updraft is weakened. The energy change rate is equal to zero. With the disappearance of the first thermal updraft, the glider flies down again and searches for the next thermal updraft for three iterations.

The three dotted circles in Figure 12 represent the process of searching for the thermal updraft three times, corresponding to the flight trajectory in Figure 10. The energy change rate in the spiraling up phase and the hovering phase is also consistent with the simulation flight situation. It can be concluded that the energy change rate formula derived previously is consistent with the flight.

The curve in Figure 13 shows the change in the total energy of the glider during the whole simulation process. The three dotted circles in the figure correspond to the process of searching for the thermal updraft three times. In the process of searching for a thermal updraft, the gliding altitude of the glider is continuously reduced, the gravitational potential energy decreases, the speed is basically unchanged, and the kinetic energy is basically unchanged; thus, the total energy decreases. When the glider encounters a thermal updraft, the glider spirals up, the altitude increases, the gravitational potential energy increases, and the flight speed initially increases and then becomes stable; thus, the total energy increases continuously. When the glider flies to the altitude where the updraft gradually weakens, its altitude and speed are stable, so the total energy basically remains unchanged. The curve is consistent with the curve of the energy change rate and the simulation flight situation.

To verify the effectiveness of the strategy more convincingly, we tested the glider in an environment with a random location thermal updraft 1000 times using the soaring strategy with the perceptual hovering agent and the regional exploration agent.

In the test results, the flight time is mainly distributed in four levels. The first level is 0–5 min when the glider has not entered a thermal updraft or has entered but not used the thermal updraft. The second level is 20–27.5 min when the glider uses one thermal updraft. The third level is 40–47.5 min when the glider uses two thermal updrafts. The fourth level is 60 min when the glider uses three thermal updrafts, which are all randomly generated thermal updrafts within a set time of one hour. The reason for this result is our setting of the time when the thermal updraft appears and dissipates in the simulation environment mentioned above.

The test statistics results are as follows (Figure 14).

The average flight time is 32.4 min, and the median flight time is 42.7 min, indicating that the glider can make good use of thermal updrafts in most tests. The number of tests in which the flight time is more than 2.5 min, which means that the glider has entered at least one thermal updraft, accounts for approximately 99.7%; however, in 41.5% of the tests, a thermal updraft has not been effectively used. The number of tests that use one to three thermal updrafts accounted for 58.2%. In 42.6% of the tests, the glider used all three randomly generated thermal updrafts successively, and the flight time reached one hour, as set by the simulation.

The glider can significantly improve the flight time performance through the guidance of the perceptual hover strategy combined with the regional exploration strategy.

7. Conclusions

Through the research and simulation of the soaring strategy, the following conclusions can be drawn:

According to the energy formula derivation, the vertical wind speed in the thermal updraft can import energy to the aircraft so that it can rely on the wind energy to perform long flights. Drag is the main factor of aircraft energy dissipation. Other factors, including the flight attitude and the change rate of the vertical wind speed, have little influence on the change in energy.
During the training of the hovering agents by reinforcement learning, location information is not added. Adding the location information will easily cause the glider to spiral in a fixed position, and the glider cannot cope with thermal updrafts that are generated at random locations. We trained a hovering agent that can guide the glider to determine the next flight trajectory by sensing the surrounding thermal updrafts through the vertical ground speed, vertical ground speed change rate, heading angle, and heading angle change rate.
In the training of gliders by reinforcement learning, the flight altitude is a more important factor than the flight time in the reward function. The training results that were obtained when the flight time was used as a reward function are very unstable. A reward function that includes the flight altitude can obtain better agents that can guide glider soaring in an environment where thermal updrafts generate and disappear at random positions.
The soaring strategy, which is composed of regional exploration and perceptual hovering, can guide the glider to carry out autonomous, long-endurance flights within a certain range with a random location thermal updraft. In an environment with a random location thermal updraft, the flight time is significantly improved by using this strategy.

Author Contributions

Conceptualization, Y.C., D.Y. and Z.W.; methodology, Y.C.; software, Y.C.; validation, Y.C., D.Y. and Z.W.; formal analysis, Y.C., D.Y. and Z.W.; investigation, Y.C. and D.Y.; resources, Z.W.; data curation, Y.C.; writing—original draft preparation, Y.C.; writing—review and editing, D.Y. and Z.W.; visualization, Y.C.; supervision, D.Y. and Z.W.; All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Rajendran, P.; Smith, H. Development of design methodology for a small solar-powered unmanned aerial vehicle. Int. J. Aerosp. Eng. 2018, 2018, 2820717. [Google Scholar] [CrossRef]
Clarke, J.H.A.; Chen, W.H. Trajectory generation for autonomous soaring UAS. Int. J. Autom. Comput. 2012, 9, 248–256. [Google Scholar] [CrossRef]
Doncieux, S.; Mouret, J.B.; Meyer, J.A. Soaring behaviors in UAVs: ‘Animat’ design methodology and current results. In Proceedings of the 7th European Micro Air Vehicle Conference (MAV07), Toulouse, France, 17–21 September 2007. [Google Scholar]
Edwards, D.J.; Silverberg, L.M. Autonomous soaring: The Montague cross-country challenge. J. Aircr. 2010, 47, 1763–1769. [Google Scholar] [CrossRef]
Edwards, D. Implementation details and flight test results of an autonomous soaring controller. In Proceedings of the AIAA Guidance, Navigation and Control Conference and Exhibit, Honolulu, HI, USA, 18–21 August 2008. [Google Scholar]
Han, J.H.; Han, Y.J.; Yang, H.H.; Lee, S.G.; Lee, E.H. A review of flapping mechanisms for avian-inspired flapping-wing air vehicles. Aerospace 2023, 10, 554. [Google Scholar] [CrossRef]
Allen, M. Autonomous soaring for improved endurance of a small uninhabitated air vehicle. In Proceedings of the 43rd AIAA Aerospace Sciences Meeting and Exhibit, Reno, NV, USA, 10–13 January 2005. [Google Scholar]
Allen, M.; Lin, V. Guidance and control of an autonomous soaring vehicle with flight test results. In Proceedings of the 45th AIAA Aerospace Sciences Meeting and Exhibit, Reno, NV, USA, 8–11 January 2007. [Google Scholar]
Li, S.; Wang, Y.; Zhou, Y.; Jia, Y.; Shi, H.; Yang, F.; Zhang, C. Multi-UAV cooperative air combat decision-making based on multi-agent double-soft actor-critic. Aerospace 2023, 10, 574. [Google Scholar] [CrossRef]
Zhu, H.; Chen, M.; Han, Z.; Lungu, M. Inverse reinforcement learning-based fire-control command calculation of an unmanned autonomous helicopter using swarm intelligence demonstration. Aerospace 2023, 10, 309. [Google Scholar] [CrossRef]
Li, D.; Zhao, D.; Zhang, Q.; Chen, Y. Reinforcement learning and deep learning based lateral control for autonomous driving [Application notes]. IEEE Comput. Intell. Mag. 2019, 14, 83–98. [Google Scholar] [CrossRef]
Chen, L.; Chang, C.; Chen, Z.; Tan, B.; Gašić, M.; Yu, K. Policy adaptation for deep reinforcement learning-based dialogue management. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018. [Google Scholar]
Eslami, S.M.A.; Rezende, D.; Besse, F.; Viola, F.; Morcos, A.S.; Garnelo, M.; Ruderman, A.; Rusu, A.A.; Danihelka, I.; Gregor, K.; et al. Neural scene representation and rendering. Science 2018, 360, 1204–1210. [Google Scholar] [CrossRef] [PubMed]
Hwangbo, J.; Lee, J.; Dosovitskiy, A.; Bellicoso, D.; Tsounis, V.; Koltun, V.; Hutter, M. Learning agile and dynamic motor skills for legged robots. Sci. Robot. 2019, 4, eaau5872. [Google Scholar] [CrossRef] [PubMed]
Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; Van den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef] [PubMed]
Berner, C.; Brockman, G.; Chan, B.; Cheung, V.; Dębiak, P.; Dennison, C.; Farhi, D.; Fischer, Q.; Hashme, S.; Hesse, C. Dota 2 with large scale deep reinforcement learning. arXiv 2019, arXiv:1912.06680. [Google Scholar]
Vinyals, O.; Babuschkin, I.; Czarnecki, W.M.; Mathieu, M.; Dudzik, A.; Chung, J.; Choi, D.H.; Powell, R.; Ewalds, T.; Georgiev, P.; et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 2019, 575, 350–354. [Google Scholar] [CrossRef] [PubMed]
Kaelbling, L.P.; Littman, M.L.; Moore, A.W. Reinforcement learning: A survey. J. Artif. Intell. Res. 1996, 4, 237–285. [Google Scholar] [CrossRef]
Sutton, R.; Barto, A. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]
Watkins, C.J.C.H.; Dayan, P. Technical note: Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
Singh, S.; Jaakkola, T.; Littman, M.L.; Szepesvári, C. Convergence results for single-step on-policy reinforcement-learning algorithms. Mach. Learn. 2000, 38, 287–308. [Google Scholar] [CrossRef]
Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic policy gradient algorithms. In Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 21–26 June 2014. [Google Scholar]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.M.O.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D.P. Continuous Control with Deep Reinforcement Learning. United States Patents US20170024643A1, 26 January 2017. [Google Scholar]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. Soft Actor-Critic Algorithms and Applications. arXiv 2018, arXiv:1812.05905. [Google Scholar] [CrossRef]
Yu, X.; Fan, Y.; Xu, S.; Ou, L. A self-adaptive SAC-PID control approach based on reinforcement learning for mobile robots. Int. J. Robust Nonlinear Control. 2022, 32, 9625–9643. [Google Scholar] [CrossRef]
Chi, H.; Zhou, M. Trajectory Planning for Hypersonic Vehicles with Reinforcement Learning. In Proceedings of the 40th Chinese Control Conference (CCC), Shanghai, China, 26–28 July 2021. [Google Scholar] [CrossRef]
Reddy, G.; Wong-Ng, J.; Celani, A.; Sejnowski, T.J.; Vergassola, M. Glider soaring via reinforcement learning in the field. Nature 2018, 562, 236–239. [Google Scholar] [CrossRef]
Reddy, G.; Celani, A.; Sejnowski, T.J.; Vergassola, M. Learning to soar in turbulent environments. Proc. Natl. Acad. Sci. USA 2016, 113, E4877–E4884. [Google Scholar] [CrossRef]

Figure 1. Flight mechanics modeling coordinate system.

Figure 2. Vertical wind speed distribution on a horizontal plane. The curved surface in the figure represents the vertical wind speed at each horizontal position at a certain height. Overall, its speed changes with the position in a Gaussian distribution. The bluer the surface color is, the smaller the vertical wind speed at that location, and the greener the color is, the greater the vertical wind speed at that location.

Figure 3. Diagrammatic sketch of the glider.

Figure 4. Aerodynamic parameter curve with variations in the angles of attack. (a) shows the curve of lift coefficient with variations in the angle of attack, (b) shows the curve of drag coefficient with variations in the angle of attack, and (c) shows the curve of lift-drag ratio with variations in the angle of attack.

Figure 5. Force analysis and attitude of the glider, where L is the lift, D is the drag, G represents the gravity, γ is the angle of climb, α is the angle of attack, and μ is the inclination angle.

Figure 6. Glider flight trajectories with different rewards. The black dot represents the initial position of the glider. The three images represent three typical flight trajectories: (a) is obtained by training with a reward function that is based on the flight time. (b,c) is obtained by training with a reward function that is based on the flight altitude or a combination of flight time and altitude.

Figure 7. Glider flight trajectory of the selected agent. The black dot represents the random initial position of the glider. The colorful curve represents the flight trajectory. If the color of the curve is close to the cold tone (blue), it means that the flight drops sharply at this moment. In contrast, the closer the color is to the warm tone (red), the faster the flight rises at this moment. The gray concentric circle on the lower plane represents the position of the thermal updraft.

Figure 8. Test results of the regional exploration agent. The green ‘⋆’ represents the starting point, the yellow ‘⋆’ represents the ending point, the black curve represents the flight path, the arrows represent the direction of flight, and the red dot matrix represents the exploration area. Intuitively, the exploration tracks of gliders are different, and the total exploration area accounts for a large proportion, reflecting the characteristics of the randomness and wide area of exploration. (a–d) are four randomly selected test results from numerous well performing test results.

Figure 9. Test statistics of the regional exploration agent.

Figure 10. Flight trajectory of 1 h based on the soaring strategy. The black dot represents the random initial position of the glider. The colorful curve represents the flight trajectory. If the color of the curve is close to the cold tone (blue), the flight drops sharply. In contrast, the closer the color is to the warm tone (red), the faster the flight rises at this moment. The three gray concentric circles with different depths in the figure represent three positions of thermal updrafts randomly generated at different times. The glider began to glide down under the guidance of the area exploration strategy. After contacting the first thermal updraft, the glider started to hover under the guidance of the perceptual hovering strategy. After the first updraft dissipates, the glider switches to the exploration mode to continue searching for the next thermal updraft.

Figure 11. Influence of various factors on the energy change rate of flights. The dark-yellow curve is the result of smoothing the light-yellow curve.

Figure 12. Total energy change rate during the flight. The dark rose red curve is the result of smoothing the light rose red curve. The three dotted circles mark the flight process where the total energy change rate is less than zero, representing the process of searching for the thermal updraft three times, corresponding to the flight trajectory in Figure 10.

Figure 13. Total energy change in the flight. The red curve shows the change in the total energy of the glider during the whole simulation process. The three dotted circles mark the flight process of total energy reduction, representing the process of searching for the thermal updraft three times, corresponding to the flight trajectory in Figure 10.

Figure 14. Flight time statistics results of 1000 tests.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cui, Y.; Yan, D.; Wan, Z. Study on the Glider Soaring Strategy in Random Location Thermal Updraft via Reinforcement Learning. Aerospace 2023, 10, 834. https://doi.org/10.3390/aerospace10100834

AMA Style

Cui Y, Yan D, Wan Z. Study on the Glider Soaring Strategy in Random Location Thermal Updraft via Reinforcement Learning. Aerospace. 2023; 10(10):834. https://doi.org/10.3390/aerospace10100834

Chicago/Turabian Style

Cui, Yunxiang, De Yan, and Zhiqiang Wan. 2023. "Study on the Glider Soaring Strategy in Random Location Thermal Updraft via Reinforcement Learning" Aerospace 10, no. 10: 834. https://doi.org/10.3390/aerospace10100834

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Study on the Glider Soaring Strategy in Random Location Thermal Updraft via Reinforcement Learning

Abstract

1. Introduction

2. UAV Energy Acquisition Principle in Thermal Updrafts

3. Simulation Environment with Random Thermal Updrafts

4. Perceptual Hovering Strategy in Thermal Updrafts

5. Regional Exploration Strategy in Certain Areas

6. Glider Soaring Strategy Test in the Simulation Environment

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI