Data-Driven Robust Control Using Reinforcement Learning

Ngo, Phuong D.; Tejedor, Miguel; Godtliebsen, Fred

doi:10.3390/app12042262

Open AccessArticle

Data-Driven Robust Control Using Reinforcement Learning

by

Phuong D. Ngo

¹

,

Miguel Tejedor

^1,* and

Fred Godtliebsen

^2,*

¹

Norwegian Centre for E-Health Research, 9019 Tromsø, Norway

²

Department of Mathematics and Statistics, Faculty of Science and Technology, UiT The Arctic University of Norway, 9019 Tromsø, Norway

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2022, 12(4), 2262; https://doi.org/10.3390/app12042262

Submission received: 11 January 2022 / Revised: 17 February 2022 / Accepted: 18 February 2022 / Published: 21 February 2022

(This article belongs to the Special Issue Advances in Intelligent Control and Image Processing)

Download

Browse Figures

Versions Notes

Abstract

:

This paper proposes a robust control design method using reinforcement learning for controlling partially-unknown dynamical systems under uncertain conditions. The method extends the optimal reinforcement learning algorithm with a new learning technique based on the robust control theory. By learning from the data, the algorithm proposes actions that guarantee the stability of the closed-loop system within the uncertainties estimated also from the data. Control policies are calculated by solving a set of linear matrix inequalities. The controller was evaluated using simulations on a blood glucose model for patients with Type 1 diabetes. Simulation results show that the proposed methodology is capable of safely regulating the blood glucose within a healthy level under the influence of measurement and process noises. The controller has also significantly reduced the post-meal fluctuation of the blood glucose. A comparison between the proposed algorithm and the existing optimal reinforcement learning algorithm shows the improved robustness of the closed-loop system using our method.

Keywords:

reinforcement learning; robust control; data-driven

1. Introduction

Control of unknown dynamic systems with uncertainties is a challenge because exact mathematical models are often required. Since many processes are complicated, nonlinear, and varying with time, a control algorithm that does not depend on a mathematical model and can adapt to time-varying conditions is required. A popular approach is to develop a universal approximator for predicting the output of unknown systems [1]. Control algorithms can then be designed based on the parameters of the approximator. Based on this approach, many control techniques have been proposed using machine learning models such as neural networks and fuzzy logic. For example, Goyal et al. [2] proposed a robust sliding mode controller which can be designed from Chebyshev neural networks. Chadli and Guerra [3] introduced a robust static output feedback controller for Takagi Sugeno fuzzy models. Ngo and Shin [4] proposed a method to model unstructured uncertainties and a new Takagi Sugeno fuzzy controller using type-2 fuzzy neural networks.

However, obtaining a good approximator requires a significant amount of training data, especially for a complicated model with high-dimensional state spaces or with many inputs and outputs. The data-driven model must also be updated frequently for time-varying systems. In addition, many control design techniques assume uncertainties as functions of system parameters. However, in many cases, the causes of uncertainties are unknown and unstructured. With the development of data science and machine learning, model-free approaches such as reinforcement learning (RL) have emerged as an effective method to control unknown nonlinear systems [5,6,7,8]. The principle of RL is based on the interaction between a decision-making agent and its environment [9], and the actor–critic method is often used as the RL framework for many control algorithms. In the actor–critic framework, the critic agent uses current state information of the environment in order to update the value or action value function. The actor agent then uses the value or action value function to calculate the optimal action.

It can be seen that many data-driven algorithms lack stability analysis of the closed-loop systems. Among recent techniques focusing on the robustness of control algorithms, Yang et al. [10] presented an off-policy reinforcement learning (RL) solution to solve robust control problems for a certain class of unknown systems with structured uncertainties. In [11], a robust data-driven controller was proposed based on the frequency response of multivariable systems and convex optimization. Based on data-driven tuning, Takabe et al. [12] introduced a detection algorithm suitable for massive overloaded multiple-input multiple-output systems. In more recent works, Na et al. [13] proposed an approach to address the output-feedback robust control for continuous-time uncertain systems using online data-driven learning, while Makarem et al. [14] used data-driven techniques for iterative feedback tuning of a proportional-integral-derivative controller’s parameters. However, in many cases, stability can only be ensured for specific systems where uncertainties are structured. In addition, the value function must be estimated accurately, which is difficult to achieve, especially at the beginning of the control process when the agent has just started interacting with the environment. Additionally, in many applications, the state space is either continuous or high-dimensional. In these cases, the value function approximation is often inaccurate, potentially leading to instability. Therefore, new RL approaches for which stability can be guaranteed under uncertain conditions are essential if algorithms are to be used in critical and safety-demanding systems.

Type 1 diabetes is a disease caused by the lack of insulin secretion. The condition results in uncontrolled increase of blood glucose level if the patients are not provided with insulin doses. High blood glucose level can lead to both acute and chronic complications, and eventually result in failure of various organs. One of the major challenges in controlling the blood glucose is that the biochemical and physiologic kinetics of insulin and glucose is complicated, nonlinear, and only approximately known [15]. Additionally, the stability of the control system is essential in this case since unstable control effort will lead to life-threatening condition for the patients.

This paper proposes a novel method to capture uncertainty in estimating the value function in reinforcement learning based on observation data. Using the uncertainty information, the paper also presents a new technique to improve the policy while guaranteeing the stability of the closed-loop system under uncertainty conditions for partially-unknown dynamical systems. The proposed methodology is applied to a blood glucose model for testing its effectiveness in controlling the blood glucose level in patients with Type 1 diabetes.

Structure of Paper

The content of the paper is organized as follows. Section 2 describes the proposed robust RL algorithm. Section 3 shows the simulation results of the methodology. The conclusions are given in Section 4.

2. Materials and Methods

In this section we present the robust RL method and the simulation setup used for evaluation of the algorithm.

2.1. Robust Control Using Reinforcement Learning

In this paper, a class of dynamical systems is considered, which can be described by the following linear state-space equation:

\dot{x} (t) = A x (t) + B u (t),

(1)

where

x \in R^{n}

is the vector of n state variables,

u \in R^{m}

is the vector of m control inputs,

A \in R^{n \times n}

is the state matrix, and

B \in R^{n \times m}

is the input matrix. It is assumed that matrix A is a squared

n \times n

unknown matrix and the system (A and B) is stabilized. Our target is to derive a control algorithm

u (t)

that can regulate the state variables contained in

x (t)

based on input and output data without knowing matrix A.

As an RL framework, the proposed robust control algorithm consists of an agent that takes actions and learns the consequences of its actions in an unknown environment. The environment is defined by a state vector

x (t)

that describes its states at time t. The action at time t is represented by

u (t)

. As a consequence of the action, a cost

r (t)

is incurred and accumulated. The cost function

r (t)

is assumed to be known and predefined as a function of the current state and action. The objective of the learning process is to minimize the total cost accumulation in the future.

At each decision time point, the agent receives information about the state of the environment and chooses an action. The environment reacts to this action and transitions to a new state, which determines whether the agent receives a positive or negative reinforcement. Current RL techniques propose optimal actions by minimizing the predicted cost accumulation. However, uncertainties due to noises in the data or inaccurate estimation of the cost accumulation can lead to suboptimal actions and even unstable responses. Our target is to provide the agent with a robust and safe action that can guarantee the reduction of the future cost accumulation in the presence of uncertainties. The action calculated by the proposed algorithm may not be the optimal action that reduces the cost in the fastest way, but it can always guarantee the stability of the system, which is imperative in many critical applications.

2.1.1. Estimation of the Value Function by the Critics

In the RL context, the accumulation of cost over time, when starting in the state

x (t)

and following policy

π

, is defined as the value function of policy

π

, i.e.,

V^{π} (x (t)) = E_{π} \{\int_{t}^{\infty} γ^{τ - t} r (τ) d τ\},

(2)

where

γ

is the discount factor. The cost

r (t)

is assumed to be a quadratic function of the states:

r (t) = x^{T} (t) Q x (t),

(3)

where the positive definite matrix

Q \in R^{n \times n}

is symmetric, positive semidefinite (since the cost is assumed to be non-negative), and contains the weighting factors of the variables that are minimized.

In order to facilitate the formulation of the stability condition in the form of linear matrix inequalities (LMI), the value function

V (x (t))

is approximated by a quadratic function of the states:

V^{π} (x (t)) \approx x^{T} (t) P x (t),

(4)

where the kernel matrix

P \in R^{n \times n}

is symmetric and positive semidefinite (since matrix Q in the cost function is symmetric and positive semidefinite).

By using the Kronecker operation, the approximated value function can be expressed as a linear combination of the basis function

ϕ (x (t)) = (x (t) \otimes x (t))

:

\begin{matrix} V^{π} (x (t)) & \approx x^{T} (t) P x (t) = v e c {(P)}^{T} (x (t) \otimes x (t)) \\ = w^{T} (x (t) \otimes x (t)) = w^{T} ϕ (x (t)), \end{matrix}

(5)

where w is the parameter vector,

ϕ (x (t))

is the vector of basis functions, and ⊗ is the Kronecker product. The transformation between w and P can be performed as follows:

w = v e c (P) = {[P_{11}, P_{21}, \dots, P_{n 1}, P_{12}, \dots, P_{1 n}, P_{n n}]}^{T},

(6)

where

P_{i, j}

is the element of matrix P in the

i th

row and

j th

column. With T as the interval time for data sampling, the integral RL Bellman equation can be used to update the value function [8]:

V^{π} (x (t)) = \int_{t}^{t + T} γ^{τ - t} r (τ) d τ + V^{π} (x (t + T)) .

(7)

By using the quadratic cost function (Equation (3)) and the approximated value function (Equation (5)), the integral RL Bellman equation can be written as follows:

x^{T} (t) P x (t) = \int_{t}^{t + T} x {(τ)}^{T} Q x (τ) d τ + x^{T} (t + T) P x (t + T)

(8)

or

\begin{matrix} w^{T} ϕ (x (t)) = \int_{t}^{t + T} & x {(τ)}^{T} Q x (τ) d τ + w^{T} ϕ (x (t + T)) . \end{matrix}

(9)

At each iteration, n samples along the state trajectory are collected (

x^{(1)} (t), x^{(2)} (t),

\dots, x^{(n)} (t)

). The mean value of w can be obtained by using least-square technique:

\hat{w} = (X X^{T}) X Y,

(10)

where

X = {[ϕ_{Δ}^{1} ϕ_{Δ}^{2} \dots ϕ_{Δ}^{N}]}^{T},

(11)

ϕ_{Δ}^{i} = ϕ (x^{i} (t)) - ϕ (x^{i} (t + T)),

(12)

Y = {[d (x^{1} (t)) d (x^{2} (t)) \dots d (x^{n} (t))]}^{T}

(13)

and

d (x^{i} (t)) = \int_{t}^{t + T} x^{i} {(τ)}^{T} Q x^{i} (τ) d τ

(14)

with

i = 1, 2, \dots, N

.

The confidence interval for the coefficient

w^{(j)}

is given by

w^{(j)} \in [{\hat{w}}^{(j)} - q_{1 - \frac{θ}{2}} \sqrt{τ_{j} {\hat{σ}}^{2}}, {\hat{w}}^{(j)} + q_{1 - \frac{θ}{2}} \sqrt{τ_{j} {\hat{σ}}^{2}}],

(15)

where

1 - θ

is the confidence level,

q_{1 - \frac{θ}{2}}

is the quantile function of standard normal distribution,

τ_{j}

is the jth element on the diagonal of

{(X X^{T})}^{- 1}

, and

{\hat{σ}}^{2} = \frac{{\hat{ϵ}}^{T} \hat{ϵ}}{n - p}

, with

ϵ = Y - \hat{w} X

. From that, the uncertainty

Δ w

is defined as the deviation interval around the nominal value:

Δ w = [- q_{1 - \frac{θ}{2}} \sqrt{τ_{j} {\hat{σ}}^{2}}, - q_{1 - \frac{θ}{2}} \sqrt{τ_{j} {\hat{σ}}^{2}}] .

(16)

Matrices

\hat{P}

and

Δ P

can be obtained by placing elements of

\hat{w}

and

Δ w

into columns.

2.1.2. Policy Improvement by the Actor

Linear feedback controllers have been widely used as a stabilization tool for nonlinear systems where dynamic behavior is considered approximately linear around the operating condition [16,17,18]. Hence, in this paper, we use linear functions of the states with gain

K_{i}

as the control policy at iteration i:

u (t) = π (x (t)) = - K_{i} x (t),

(17)

and the level of uncertainty is constant during the controlling process. The task of the actor is to robustly improve the current policy such that the value function is guaranteed to be reduced during the next policy implementation. If the following differential inequality is satisfied:

{\dot{V}}_{i} (x (t)) + α V_{i} (x (t)) \leq 0

(18)

with some positive constant

α

, then by using the comparison lemma (Lemma 3.4 in [19]), the derivative of function

{\dot{V}}_{i} (x (t))

can be bounded by

{\dot{V}}_{i} (x (t)) \leq V_{i} (x (t_{0})) e^{- α (t - t 0)} .

(19)

Therefore, maximizing the rate

α

will ensure a maximum exponential decrease in the value of

{\dot{V}}_{i} (x (t))

.

The following part shows the main results of the paper, which describe how the policy gain can be improved during the learning process. Derivations of the results are provided in the stability analysis (Section 2.1.3).

Definition 1.

Assume A is a square matrix with dimension

n \times n

and x is a vector with dimension

n \times 1

. The maximize operation on matrix A and vector x is defined as follows:

maximize (A, x) = C,

(20)

where

C_{i j} = \{\begin{matrix} \max (A_{i j}) if x_{i} x_{j} \geq 0 \\ \min (A_{i j}) if x_{i} x_{j} < 0 \end{matrix} with i, j = 1 \dots n .

(21)

Assuming that the sign of all state variables cannot be changed between each policy update interval, the improved policy

K_{i + 1}

can be obtained by minimizing

α

subject to

\begin{matrix} [\begin{matrix} V & K_{i + 1}^{T} B^{T} \\ B K_{i + 1} & - γ I \end{matrix}] \leq 0 \end{matrix}

(22)

and

[\begin{matrix} ζ & K_{i + 1} \\ K_{i + 1}^{T} & - I \end{matrix}] \leq 0,

(23)

where:

V = M + Δ P_{i, \max}^{T} Δ P_{i, \max} γ - {\hat{P}}_{i} B K_{i + 1} - K_{i + 1}^{T} B^{T} {\hat{P}}_{i} + α ({\hat{P}}_{i} + \frac{1}{2} Δ P_{i, \max}^{T} Δ P_{i, \max} + I)

(24)

and

\begin{matrix} M = & - Q - K_{i}^{T} R K_{i} + {\hat{P}}_{i} B K_{i} + K_{i}^{T} B^{T} {\hat{P}}_{i} + H_{i} \end{matrix}

(25)

with

Δ P_{i, \max} = maximize (Δ P_{i}, x)

and

H_{i} = maximize (Δ P_{i} B K_{i} + K_{i}^{T} B^{T} Δ P_{i}, x)

, utilizing the maximize operation defined in Definition 1. Inequality (22) provides the stable condition and its derivation is provided in Section 2.1.3. Inequality (23) provides the upper bound for the updated gain

K_{i + 1}

through the user-defined parameter

ζ

. The value of

- ζ

limits the maximum

L^{2}

gain of

K_{i + 1}

since inequality (23) is equivalent to

K_{i + 1} K_{i + 1}^{T} \leq - ζ

.

2.1.3. Stability Analysis

With the control policy as described in Equation (17), the equation for the closed-loop system can be derived as follows:

\dot{x} (t) = A x (t) - B K x (t) = (A - B K) x (t) .

(26)

Lemma 1.

Assuming that the closed-loop system described by Equation (26) is stable, solving for P in Equation (8) is equivalent to finding the solution of the underlying Lyapunov equation [8]:

P (A - B K) + {(A - B K)}^{T} P = - Q .

(27)

Proof of Lemma 1.

We start with Equation (27) and try to prove that matrix P is also the solution of Equation (8). Consider

V (x (t)) = x^{T} (t) P x (t)

, where P is the solution of Equation (27):

\begin{matrix} \dot{V} (x (t)) & = \frac{d (x^{T} (t) P x (t))}{d t} \\ = {\dot{x}}^{T} (t) P x (t) + x^{T} (t) P \dot{x} (t) \\ = x^{T} (t) [{(A - B K)}^{T} P + P (A - B K)] x (t) \\ = - x^{T} (t) Q x (t) (using Equation (27)) . \end{matrix}

(28)

Since the closed-loop system is stable, the Lyapunov Equation (27) has a unique solution,

P_{i} > 0

. From (28), this solution will satisfy

\frac{d (x^{T} (t) P_{i} x (t))}{d t} = - x^{T} (t) Q x (t),

(29)

which is equivalent to

x^{T} (t + T) P x (t + T) - x^{T} (t) P x (t) = \int_{t}^{t + T} - x^{T} (τ) Q x (τ) d τ .

(30)

Therefore, P is also the solution of Equation (8). □

Lemma 2.

Given matrices E and F with appropriate dimensions, the following LMI can be obtained:

E F^{T} + F E^{T} \leq E E^{T} + F F^{T} .

(31)

Proof of Lemma 2.

From the properties of matrix norm, we have

(E - F) {(E - F)}^{T} \geq 0,

(32)

which is equivalent to

E E^{T} + F F^{T} - E F^{T} - F E^{T} \geq 0

(33)

or

E F^{T} + F E^{T} \leq E E^{T} + F F^{T} .

(34)

□

Lemma 3.

Given A as a square matrix with dimension

n \times n

and x as a vector with dimension

n \times 1

, the following LMI can be obtained:

x^{T} A x \leq x^{T} C x,

(35)

where

C = maximize (A, x)

as in Definition 1.

Proof of Lemma 3.

We have

\begin{matrix} x^{T} A x & = \sum_{i, j = 1, 2 \dots n} a_{i j} x_{i} x_{j} \leq \sum_{i, j = 1, 2 \dots n} | a_{i j} x_{i} x_{j} | \\ = \sum_{i, j = 1, 2 \dots n} c_{i j} x_{i} x_{j} = x^{T} C x, \end{matrix}

(36)

where

c_{i j} = \{\begin{matrix} \max (a_{i j}) if x_{i} x_{j} \geq 0 \\ \min (a_{i j}) if x_{i} x_{j} < 0 \end{matrix}

with

i, j = 1 \dots n

. □

Theorem 1.

Consider a dynamic system that can be represented by Equation (1) with the state matrix A unknown. Assume that the sign of all state variables cannot be changed between each policy update interval and the estimated value function at iteration i is

V_{i} (x (t)) = x^{T} (t) P_{i} x (t)

with

P_{i} = {\hat{P}}_{i} + Δ P_{i}

. If

The current control policy $u (t) = π_{i} (x (t)) = - K_{i} x (t)$ is stabilizing;
The LMI given in (22) is satisfied with some positive constant γ;

then the closed-loop system with the control policy

u (t) = - K_{i + 1} x (t)

is quadratic stable with convergence rate α.

Proof of Theorem 1.

Since the current control policy is stable, the estimated parameter matrix

P_{i}

is positive definite. Hence,

V_{i} (x (t)) = x_{t}^{T} P_{i} x_{t} > 0

. Here,

V_{i} (x (t))

is used as the Lyapunov function for the updated control policy

u (t) = π_{i + 1} (x (t)) = - K_{i + 1} x (t)

. For notation convenience, the state vector

x (t)

and input vector

u (t)

are denoted as

x_{t}

and

u_{t}

, respectively. By using Equation (27) in Lemma 1 and the representation

P_{i} = {\hat{P}}_{i} + Δ P_{i}

, we can calculate the left side of Equation (18) as follows:

\begin{matrix} {\dot{V}}_{i} & (x (t)) + α V_{i} (x (t)) \\ = & {\dot{x}}_{t}^{T} P_{i} x_{t} + x_{t} P_{i} {\dot{x}}_{t}^{T} + α x_{t}^{T} P_{i} x_{t} \\ = & {(A x_{t} + B u_{t})}^{T} P_{i} x_{t} + x_{t} P_{i} {(A x_{t} + B u_{t})}^{T} + α x_{t}^{T} P_{i} x_{t} \\ = & x_{t}^{T} [P_{i} (A - B K_{i + 1}) + {(A - B K_{i + 1})}^{T} P_{i} + α P_{i}] x_{t} \\ = & x_{t}^{T} [P_{i} (A - B K_{i}) + {(A - B K_{i})}^{T} P_{i} + α P_{i}] x_{t} + x_{t}^{T} [P_{i} B (K_{i} - K_{i + 1}) + {(K_{i} - K_{i + 1})}^{T} B^{T} P_{i} + α P_{i}] x_{t} \\ = & - x_{t}^{T} [Q + K_{i}^{T} R K_{i}] x_{t} + x_{t}^{T} [({\bar{P}}_{i} + Δ P_{i}) B (K_{i} - K_{i + 1}) + {(K_{i} - K_{i + 1})}^{T} B^{T} ({\bar{P}}_{i} + Δ P_{i}) + α {\bar{P}}_{i} + α Δ P_{i}] x_{t} \\ = & x_{t}^{T} [- Q - K_{i}^{T} R K_{i} + {\bar{P}}_{i} B K_{i} + K_{i}^{T} B^{T} {\bar{P}}_{i} + α {\bar{P}}_{i} + Δ P_{i} B K_{i} + K_{i}^{T} B^{T} Δ P_{i} - Δ P_{i} B K_{i + 1} \\ - & K_{i + 1}^{T} B^{T} Δ P_{i} - {\bar{P}}_{i} B K_{i + 1} - K_{i + 1}^{T} B^{T} {\bar{P}}_{i} + α Δ P_{i}] x_{t} . \end{matrix}

By using Lemma 3, we have the following inequality:

Δ P_{i} B K_{i} + K_{i}^{T} B^{T} Δ P_{i} \leq H_{i},

(37)

and the following inequality can be obtained by Lemma 2:

\begin{matrix} - Δ P_{i} B K_{i + 1} - K_{i + 1}^{T} B^{T} Δ P_{i} & \leq γ Δ P_{i} Δ P_{i}^{T} + \frac{1}{γ} {(B K_{i + 1})}^{T} (B K_{i + 1}) \\ \leq γ Δ P_{i, \max} Δ P_{i, \max}^{T} + \frac{1}{γ} K_{i + 1}^{T} B^{T} B K_{i + 1} \end{matrix}

(38)

Additionally,

α Δ P_{i} \leq α (\frac{1}{2} Δ P_{i} Δ P_{i}^{T} + I) \leq α (\frac{1}{2} Δ P_{i, \max} Δ P_{i, \max}^{T} + I),

(39)

where

H_{i} = maximize (Δ P_{i} B K_{i} + K_{i}^{T} B^{T} Δ P_{i}, x)

, and

Δ P_{i, \max} = maximize (Δ P_{i}, x)

, utilizing the maximize operator defined in Definition 1.

Hence,

{\dot{V}}_{i} (x (t)) + α V_{i} (x (t))

can be bounded by

\begin{matrix} {\dot{V}}_{i} (x (t)) + α V_{i} (x (t)) \leq x_{t}^{T} [- Q - K_{i}^{T} R K_{i} + {\bar{P}}_{i} B K_{i} + K_{i}^{T} B^{T} {\bar{P}}_{i} + α ({\bar{P}}_{i} + \frac{1}{2} Δ P_{i, \max} Δ P_{i, \max}^{T}) \\ - {\bar{P}}_{i} B K_{i + 1} - K_{i + 1}^{T} B^{T} {\bar{P}}_{i} + γ Δ P_{i, \max} Δ P_{i, \max}^{T} + \frac{1}{γ} K_{i + 1}^{T} B^{T} B K_{i + 1}] x_{t} . \end{matrix}

(40)

Using the Lyapunov theory, the system will be quadratic stable with the convergent rate

α

if

{\dot{V}}_{i} (x (t)) \leq - α V_{i} (x (t))

. This condition is satisfied if

\begin{matrix} x_{t}^{T} [- Q - K_{i}^{T} R K_{i} + {\bar{P}}_{i} B K_{i} + K_{i}^{T} B^{T} {\bar{P}}_{i} + α ({\bar{P}}_{i} + \frac{1}{2} Δ P_{i, \max} Δ P_{i, \max}^{T}) - {\bar{P}}_{i} B K_{i + 1} - K_{i + 1}^{T} B^{T} {\bar{P}}_{i} \\ + γ Δ P_{i, \max} Δ P_{i, \max}^{T} + \frac{1}{γ} K_{i + 1}^{T} B^{T} B K_{i + 1}] x_{t} \leq 0 . \end{matrix}

The above condition can be written in the matrix form, as shown in Theorem 1. □

By using Theorem 1, it can be seen that with the proposed improved policy, the closed-loop system will be asymptotically stable. It is also noted that Theorem 1 is also applicable for unknown nonlinear systems if they can be approximated by a linear state-space equation (Equation (1)) and if their nonlinearity is within the uncertainty bound

Δ P

calculated from

Δ w

in Equation (16).

2.1.4. Robust Reinforcement Learning Algorithm

The robust RL algorithm for controlling partially unknown dynamically systems includes the following steps:

Initialization

(Step

i = 0

)

Select an initial policy $u (t) = - K_{0} x (t)$ .

Estimation of the Value Function

(Step

i = 1, 2, \dots

)

Apply the control action $u (t)$ based on the current policy $u (t) = - K_{i} x (t)$ .
At time $t + T$ , collect and compute the dataset $(X, Y)$ , which are defined in Equations (11) and (13).
Update vector w by using the batch least-square method (Equation (10)).

Control Policy Update

Transform vector w into the kernel matrix P using the Kronecker transformation.
Update the policy by solving the LMI in Theorem 1.

Figure 1 shows the simplified diagram of the above algorithm. It is noted that the estimation of the value function is an on-policy learning since it updates

V^{π} (x (t))

using the V-value of the next state and the current policy’s action.

2.1.5. Simulation Setup

A simulation study of the proposed robust RL controller was conducted on a glucose kinetics model, which can be described by [20,21,22,23]:

\frac{d D_{1} (t)}{d t} = A_{G} D (t) - \frac{D_{1} (t)}{τ_{D}},

(41)

\frac{d D_{2} (t)}{d t} = \frac{D_{1} (t)}{τ_{D}} - \frac{D_{2} (t)}{τ_{D}},

(42)

\frac{d g (t)}{d t} = - p_{1} g (t) - χ (t) g (t) + \frac{D_{2} (t)}{τ_{D}} + w (t)

(43)

and

\frac{d χ (t)}{d t} = - p_{2} χ (t) + p_{3} V (i (t) - i_{b} (t)) .

(44)

In this model, parameter and variable descriptions can be found in Table 1 and Table 2, respectively. The values of the parameters are selected based on [20,21]. Variable

w (t)

in Equation (43) is the process noise. The measured blood glucose value is affected by a random noise

v (t)

:

\hat{g} (t) = g (t) + v (t) .

(45)

The inputs of the model are the amount of carbohydrate intake D and the insulin concentration i. The value of

i (t) - i_{b} (t)

must be non-negative:

i (t) - i_{b} (t) \geq 0 .

(46)

3. Results and Discussion

In order to evaluate the performance of the robust RL controller, we implemented the controller on the glucose kinetics model as described in the previous section under a daily scenario of patients with Type 1 diabetes. In order to make the scenario realistic, three different levels of uncertainties were used in the model. Uncertainties include process noise (

w (t)

) and measurement noise (

v (t)

). It is assumed that the noises are Gaussian distributions with standard deviations for each case as shown in Table 3.

3.1. Without Meal Intake

This part describes the simulation results during the fasting period (without meal intake). The purpose of the simulation is to compare the performances of the robust RL algorithm with the conventional optimal RL algorithm [24] in the nominal condition (uncertainty case 1). The initial blood glucose for both scenario was set at 290 mg/dL and the target blood glucose is 90 mg/dL. The initial policy at the beginning of the simulation was chosen as follows:

u (t) = - K_{0} x (t) = - 0.27 g (t) + 266.00 χ (t) .

(47)

Figure 2 shows the comparison in blood glucose level between the robust RL and the optimal RL algorithm in the nominal condition. From the results, it can be seen that the robust RL successfully reduces the blood glucose level while the optimal RL becomes unstable when the blood glucose approaches the desired value. The instablity of the optimal RL in this case can be explained by the nonlinearity of the system (due to the coupling term

χ (t) g (t)

in Equation (43)), the saturation of the insulin concentration (Equation (46)), and the lack of perturbed data when the blood glucose approaches the steady-state value. The insulin concentration during the simulation can be found in Figure 3. In this figure, the dotted blue line indicates the unstable insulin profile.

Figure 4 shows the blood glucose responses from the robust RL in different uncertain conditions without meal intake. The results show similar and stable responses in all the uncertain conditions with settling time to the desired blood glucose level of approximately 45 min. The insulin concentration and the update of controller gains can be found in Figure 5 and Figure 6.

3.2. With Meal Intake

In this part, the performance of the robust RL controller was tested under conditions for which the system is subjected to meal intakes with the carbohydrate profile as shown in Figure 7.

During the simulation period with meal intakes, blood glucose responses throughout the day of the robust RL control systems under four uncertain cases are shown in Figure 8. The insulin concentration during the process can also be found in Figure 9. The results show that the controller provides the most aggressive action under case 1 (no uncertainty) and the least aggressive action under case 4 (with highest level of measurement and process noises). This leads to the largest and smallest reduction of postprandial blood glucose in case 1 and case 4, respectively. Most importantly, the robust RL algorithm kept the system in stable condition and there is no hypoglycemia event during the simulation for all four cases under different level of uncertainties.

4. Conclusions

The paper proposes a robust RL algorithm for dynamical systems with uncertainties. The uncertainties can be approximated by the critic and represented in the value function. LMI techniques were used to improve the controller gain. The algorithm was simulated on a blood glucose model for patients with Type 1 diabetes. The objective of the simulation is to control and maintain a healthy blood glucose level. The comparison between the robust RL algorithm and the optimal RL algorithm shows a significant improvement in the robustness of the proposed algorithm. Simulation results show that the algorithm successfully regulated the blood glucose and kept the system stable under different levels of uncertainty.

Author Contributions

P.D.N. conceptualized ideas, developed algorithms, performed training and validation, numerical simulations, and led the writing process. M.T. contributed to the development of algorithms, provided critical feedback, analyzed results, and read and approved the final manuscript. F.G. acquired funding and resource, managed the project, and provided critical feedback leading to this publication. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Tromsø Research Foundation under project “A smart controller for T1D using RL and SS representation” with grant/award number: A3327. The article processing charge was funded by a grant from the publication fund of UiT The Arctic University of Norway.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data analyzed or generated during the study is available upon request.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

LMI	Linear matrix inequalities
RL	Reinforcement learning

References

Lee, H.; Tomizuka, M. Robust adaptive control using a universal approximator for SISO nonlinear systems. IEEE Trans. Fuzzy Syst. 2000, 8, 95–106. [Google Scholar] [CrossRef]
Goyal, V.; Deolia, V.K.; Sharma, T.N. Robust sliding mode control for nonlinear discrete-time delayed systems based on neural network. Intell. Control Autom. 2015, 06, 75–83. [Google Scholar] [CrossRef] [Green Version]
Chadli, M.; Guerra, T.M. LMI solution for robust static output feedback control of discrete Takagi-Sugeno fuzzy models. IEEE Trans. Fuzzy Syst. 2012, 20, 1160–1165. [Google Scholar] [CrossRef]
Ngo, P.D.; Shin, Y.C. Modelling of unstructured uncertainties and robust controlling of nonlinear dynamic systems based on type-2 fuzzy basis function networks. Eng. Appl. Artif. Intell. 2016, 53, 74–85. [Google Scholar] [CrossRef]
Bothe, M.K.; Dickens, L.; Reichel, K.; Tellmann, A.; Ellger, B.; Westphal, M.; Faisal, A.A. The use of reinforcement learning algorithms to meet the challenges of an artificial pancreas. Expert Rev. Med. Devices 2013, 10, 661–673. [Google Scholar] [CrossRef]
De Paula, M.; Ávila, L.O.; Martínez, E.C. Controlling blood glucose variability under uncertainty using reinforcement learning and Gaussian processes. Appl. Soft Comput. J. 2015, 35, 310–332. [Google Scholar] [CrossRef]
Ouyang, Y.; He, W.; Li, X. Reinforcement learning control of a single-link flexible robotic manipulator. IET Control Theory Appl. 2017, 11, 1426–1433. [Google Scholar] [CrossRef]
Vrabie, D.; Vamvoudakis, K.G.; Lewis, F.L. Optimal Adaptive Control and Differential Games by Reinforcement Learning Principles, 1st ed.; Institution of Engineering and Technology: London, UK, 2012; Volume 81. [Google Scholar] [CrossRef] [Green Version]
Sutton, R.; Barto, A. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018; p. 129. [Google Scholar]
Yang, Y.; Guo, Z.; Xiong, H.; Ding, D.W.; Yin, Y.; Wunsch, D.C. Data-Driven Robust Control of Discrete-Time Uncertain Linear Systems via Off-Policy Reinforcement Learning. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3735–3747. [Google Scholar] [CrossRef]
Karimi, A.; Kammer, C. A data-driven approach to robust control of multivariable systems by convex optimization. Automatica 2017, 85, 227–233. [Google Scholar] [CrossRef] [Green Version]
Takabe, S.; Imanishi, M.; Wadayama, T.; Hayakawa, R.; Hayashi, K. Trainable Projected Gradient Detector for Massive Overloaded MIMO Channels: Data-Driven Tuning Approach. IEEE Access 2019, 7, 93326–93338. [Google Scholar] [CrossRef]
Na, J.; Zhao, J.; Gao, G.; Li, Z. Output-Feedback Robust Control of Uncertain Systems via Online Data-Driven Learning. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 2650–2662. [Google Scholar] [CrossRef]
Makarem, S.; Delibas, B.; Koc, B. Data-Driven Tuning of PID Controlled Piezoelectric Ultrasonic Motor. Actuators 2021, 10, 148. [Google Scholar] [CrossRef]
Wang, Q.; Molenaar, P.; Harsh, S.; Freeman, K.; Xie, J.; Gold, C.; Rovine, M.; Ulbrecht, J. Personalized state-space modeling of glucose dynamics for type 1 diabetes using continuously monitored glucose, insulin dose, and meal intake: An extended Kalman filter approach. J. Diabetes Sci. Technol. 2014, 8, 331–345. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Kothare, M.V.; Balakrishnan, V.; Morari, M. Robust constrained model predictive control using linear matrix inequalities. Automatica 1996, 32, 1361–1379. [Google Scholar] [CrossRef] [Green Version]
Fu, J.H.; Abed, E. Linear feedback stabilization of nonlinear systems. In Proceedings of the 30th IEEE Conference on Decision and Control, Brighton, UK, 11–13 December 1991; pp. 58–63. [Google Scholar] [CrossRef]
Eker, S.A.; Nikolaou, M. Linear control of nonlinear systems: Interplay between nonlinearity and feedback. AIChE J. 2002, 48, 1957–1980. [Google Scholar] [CrossRef]
Khalil, H. Nonlinear Systems; Prentice Hall: Hoboken, NJ, USA, 2002; p. 218. [Google Scholar]
Bergman, R.N.; Ider, Y.Z.; Bowden, C.R.; Cobelli, C. Quantitative estimation of insulin sensitivity. Am. J. Physiol. Endocrinol. Metab. 1979, 236, E667. [Google Scholar] [CrossRef] [PubMed]
Hovorka, R.; Canonico, V.; Chassin, L.J.; Haueter, U.; Massi-Benedetti, M.; Orsini Federici, M.; Pieber, T.R.; Schaller, H.C.; Schaupp, L.; Vering, T.; et al. Nonlinear model predictive control of glucose concentration in subjects with type 1 diabetes. Physiol. Meas. 2004, 25, 905–920. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Wilinska, M.E.; Chassin, L.J.; Schaller, H.C.; Schaupp, L.; Pieber, T.R.; Hovorka, R. Insulin kinetics in type-1 diabetes: Continuous and bolus delivery of rapid acting insulin. IEEE Trans. Biomed. Eng. 2005, 52, 3–12. [Google Scholar] [CrossRef] [PubMed]
Mösching, A. Reinforcement Learning Methods for Glucose Regulation in Type 1 Diabetes. Master’s Thesis, Ecole Polytechnique Federale de Lausanne, Lausanne, Switzerland, 2016. [Google Scholar]
Ngo, P.D.; Wei, S.; Holubova, A.; Muzik, J.; Godtliebsen, F. Reinforcement-learning optimal control for type-1 diabetes. In Proceedings of the 2018 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI), Las Vegas, NV, USA, 4–7 March 2018; pp. 333–336. [Google Scholar] [CrossRef]

Figure 1. Data-driven robust reinforcement learning diagram.

Figure 2. Comparison of blood glucose responses in nominal case without meal intake.

Figure 3. Comparison of insulin concentration in nominal case without meal intake.

Figure 4. Comparison of blood glucose responses in uncertain cases without meal intake.

Figure 5. Insulin concentration in uncertain cases without meal intake.

Figure 6. Update of controller gains during the learning process (

K 1

and

K 2

represent the first and second element of the controller gain vector K).

Figure 6. Update of controller gains during the learning process (

K 1

and

K 2

represent the first and second element of the controller gain vector K).

Figure 7. Carbohydrate intake per meal.

Figure 8. Blood glucose responses in simulation with meals.

Figure 9. Insulin concentration in simulation with meals.

Table 1. Glucose kinetics model parameters.

Parameter	Description	Unit
$p_{1}$	Glucose effectiveness	$\min^{- 1}$
$p_{2}$	Insulin sensitivity	$\min^{- 1}$
$p_{3}$	Insulin rate of clearance	$\min^{- 1}$
$A_{G}$	Carbohydrate bioavailability	$\min^{- 1}$
$τ_{D}$	Glucose absorption constant	min
V	Plasma volume	mL
$i_{b} (t)$	Initial basal rate	$μ$ IU/(mL·min)

Table 2. Variables of the glucose kinetics model.

Variable	Description	Unit
D	Amount of carbohydrate intake	mmol/min
$D_{1}$	Glucose in compartment 1	mmol
$D_{2}$	Glucose in compartment 2	mmol
$g (t)$	Plasma glucose concentration	mmol/L
$χ (t)$	Interstitial insulin activity	$\min^{- 1}$
$i (t)$	Plasma insulin concentration	$μ IU / mL$

Table 3. Standard deviations of process and measurement noises.

Uncertainty Case	Process Noise ( $w (t)$ )	Measurement Noise ( $v (t)$ )
1	0	0
2	0	0.002
3	0.1	0.1
4	0.1	1

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ngo, P.D.; Tejedor, M.; Godtliebsen, F. Data-Driven Robust Control Using Reinforcement Learning. Appl. Sci. 2022, 12, 2262. https://doi.org/10.3390/app12042262

AMA Style

Ngo PD, Tejedor M, Godtliebsen F. Data-Driven Robust Control Using Reinforcement Learning. Applied Sciences. 2022; 12(4):2262. https://doi.org/10.3390/app12042262

Chicago/Turabian Style

Ngo, Phuong D., Miguel Tejedor, and Fred Godtliebsen. 2022. "Data-Driven Robust Control Using Reinforcement Learning" Applied Sciences 12, no. 4: 2262. https://doi.org/10.3390/app12042262

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Data-Driven Robust Control Using Reinforcement Learning

Abstract

1. Introduction

Structure of Paper

2. Materials and Methods

2.1. Robust Control Using Reinforcement Learning

2.1.1. Estimation of the Value Function by the Critics

2.1.2. Policy Improvement by the Actor

2.1.3. Stability Analysis

2.1.4. Robust Reinforcement Learning Algorithm

Initialization

Estimation of the Value Function

Control Policy Update

2.1.5. Simulation Setup

3. Results and Discussion

3.1. Without Meal Intake

3.2. With Meal Intake

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI