Sage Journals: Discover world-class research

Abstract

Inaccurate system parameters and unpredicted external disturbances affect the performance of non-linear controllers. In this paper, a new adaptive control algorithm under the reinforcement framework is proposed to stabilize a quadrotor helicopter. Based on a command-filtered non-linear control algorithm, adaptive elements are added and learned by policy-search methods. To predict the inaccurate system parameters, a new kernel-based regression learning method is provided. In addition, Policy learning by Weighting Exploration with the Returns (PoWER) and Return Weighted Regression (RWR) are utilized to learn the appropriate parameters for adaptive elements in order to cancel the effect of external disturbance. Furthermore, numerical simulations under several conditions are performed, and the ability of adaptive trajectory-tracking control with reinforcement learning are demonstrated.

Keywords

Reinforcement Learning Adaptive Control Quadrotor

1. Introduction

In recent decades, intelligent aerial vehicles have experienced an explosive growth. Self-improvement ability has become a natural property for robotic machines, while the hard-coding controller cannot adapt to the new environments. Human beings learn to adjust their actions from the consequences of implementing policies [1]. Based on these experiences, humans facilitate the development and expression of adaptive behaviours to complete new tasks by trial and error. Reinforcement learning refers to an algorithm for robots to learn optimal policies [2].

Reinforcement learning techniques are widely utilized for smart robots [3 –5]. Over the last few decades, policy-search methods have proven more efficient than value-search methods in the domain of continued states and high-dimension problems [6], and several applications have been proposed [3, 7, 8]. In robot learning, some experiences can be modelled as motor primitives that were introduced to construct complex behaviours [9]. By adjusting the meta-parameters, the robots can generalize new behaviours in order to adapt to new situations without having to relearn the overall shape of motion [10]. Jan Peters employed an expectation-maximization policy-search algorithm called reward-weighted regression(RWR) to solve an optimal control problem for robots [11]. Jens Kober [12] generalized the RWR algorithm with predictive variance for exploration and constructed a kernel-based version called Cost-regularized Kernel Regression(CrKR). CrKR and RWR drew meta-parameters from Gaussian distribution $γ \sim N (γ | \bar{γ}, σ^{2})$ in order to implement a protocol for learning. With state-independent variance, the exploration caused a large variance at the beginning of the project, and expensive trials for several applications such as flying vehicles. Policy learning by Weighting Exploration with the Returns(PoWERs) applied a state-dependent variance to avoid sharp perturbations in actions. In this paper, we introduce a kernel-based PoWERs for learning meta-parameters.

The quadrotor unmanned aircraft is an attractive vertical take-off and landing aerial vehicle that has attracted a lot of attention. The aerial vehicle is a typical under-actuated and non-linear coupled system. Many control methods are designed for the stabilization and trajectory tracking of the quadrotor. Bouabdallah [13, 14] applied the classical PID and LQ algorithm to the attitude stabilization and achieved trajectory tracking by combining an inner/outer loop with backstepping and sliding-mode techniques. Madani and Benalleque [15] designed a controller to track the desired trajectory by using a full state backstepping approach. Moreover, they [16] considered the uncertainty and unknown dynamics of the quadrotor and presented a robust trajectory-tracking controller. The neural network, as a powerful approximation approach, was introduced into the domain of adaptive and robust control. Nicol, Macnab and Ramirez-Serrano [17] utilized robust control with a neural network to solve the problem of wind disturbance. Raimúndez and Villaverde [18] augmented a conventional PD controller by using a real-time tuning, single hidden-layer neural network as an adaptive element to account for model's inversion error cancellation.

In this paper, we present a uniform framework of adaptive control by using policy-searching algorithms for the quadrotor. Four policy-search methods are utilized for adjusting model parameters and the compensation for external disturbances by learning the adaptive elements. The four methods are RWR, CrKR, PoWER and the kernel version of PoWER, which are introduced in this paper.

This paper is organized as follows. Section 2 illustrates the details of the policy-learning problem. The reviews of RWR, CrKR and PoWER are presented in Section 3, along with the description of an improved new algorithm. The model of the quadrotor is illustrated in Section 4. In Section 5, the landscape of the adaptive trajectory controller is described. Evaluations of and experiments with four algorithms are presented in Section 6, and we offer some concluding remarks in Section 7.

2. Policy-learning Problem

In a real control problem, a policy or controller is constructed in order to acquire effective actions for some purposes. The policies can be determined by using appropriate parameters γ, which can be drawn from the stochastic distribution of $π (γ | s)$ . The policy-searching algorithm learns to acquire the optimal γ by maximizing the expected return of:

$J (π) = \int_{S} p (s) \int_{Γ} π (γ | s) R (s) d γ d s$ (1)

where $S$ denotes the space of states s, $Γ$ denotes the space of parameter γ and $R (s)$ represents the return on specified s. When episodic reinforcement learning is considered with the finite horizon T, the return is presented as:

$R (s) = \sum_{t = 0}^{T - 1} r (s, t) / T$ (2)

where $r (s, t)$ denotes the immediate return for each step.

For policy searching, a Gaussian policy is always employed, as well as:

$π (γ | s) \sim N (γ | \bar{γ} (s), \sum_{γ})$ (3)

where $\bar{γ} (s) = ϕ {(s)}^{T} ω$ and $\bar{γ} (s)$ are adopted with the basis function $ϕ (s)$ and weights ω for the mean parameters. The exploration of the policy can be constructed with the state-independent variance of $\sum_{γ} = σ^{2} I$ , where σ is a constant parameter for variance or the state-dependent variance of $\sum_{γ} = ϕ {(s)}^{T} \sum_{ω} ϕ (s)$ , where the $\sum_{ω}$ is a constant variance of the weights.

3. Policy Search Approaches

In this section, firstly, we review three Expectation Maximization(EM) algorithms provided by Jan Peters and Jens Kober [11, 19, 20]. Subsequently, the weighting exploration with a cost-regularized kernel regression is described.

3.1. Expectation-Maximization-based policy-search approaches

While the policy-gradient methods require the user to specify a learning rate for the parameter learning's progress, Expectation-Maximization policy search approaches update parameters as a weighted maximum likelihood estimate, which has a closed form solution for most of the policies used.

The Monte-Carlo Expectation-Maximization(MC-EM) method is considered as an efficient policy-search method. Firstly, an episode-based MC-EM algorithm is presented by Algorithm 1 [21].

Algorithm 1

Episode-Based MC-EM Policy Updates

Require: d: weighting function, d = f(R). 1: repeat 2: Collect data set

D_{e}^{[i]} = {s^{[i]}, γ^{[i]}, R^{[i]}}

; 3: Compute weighting

d^{[i]} = f (R^{[i]})

. 4: Compute the new weights ω by

ω_{n e w} = arg max_{ω} \sum_{i = 1}^{N} d^{[i]} log π_{ω} (γ^{[i]} | s^{[i]}) (I)

where γ^[i] = ϕ(s^[i])^Tω^[i] and N is number of executed episodes; 5: Update policy parameter by γ_new = ϕ(s)^T ω_new 6: until Convergnence ω_new ≈ ω_old

Reward-Weighted Regression(RWR) is a version of an episode-based MC-EM policy-learning algorithm, which uses a linear mean policy with a state-independent variance for $π (γ | s)$ and $π (γ | s) = N (γ | ϕ {(s)}^{T} ω, Σ_{γ})$ . We can update parameters ω by Eq. (I). The optimal object with a generalization term is described by:

$J (ω) = \sum_{i = 1}^{N} d^{[i]} \log π_{ω} (γ^{[i]} | s^{[i]}) - \frac{λ}{2} ω^{T} ω$ (4)

where λ is a constant used to measure the generalization term. The optimal weights ω are obtained by the differentiation of Equation (4) with respect to ω as:

$\frac{\partial J (ω)}{\partial ω} = \sum_{i = 1}^{N} d^{[i]} \frac{\partial}{\partial ω} \log π_{ω} (γ^{[i]} | s^{[i]}) - λ ω = 0$ (5)

and

$\frac{\partial \log π_{ω} (γ^{[i]} | s^{[i]})}{\partial ω} = \frac{2 ϕ (γ^{T} - ϕ^{T} ω)}{2 σ^{2}} = \frac{ϕ (γ^{T} - ϕ^{T} ω)}{σ^{2}}$ (6)

We insert Eq. (6) into Eq. (5), which can be rewritten as:

$\frac{\partial J (ω)}{\partial ω} = \frac{Φ D (Γ - Φ^{T} ω)}{σ^{2}} - λ ω = 0$ (7)

where $Φ = [ϕ (s^{[1]}), \dots, ϕ (s^{[N]})]$ , $Γ = [γ^{[1]}, \dots, γ^{[N]}]^{T}$ and the diagonal matrix D contains the weights $d^{[i]}$ placed on the diagonal. With the constants $σ^{2}$ and λ, we can define that $\bar{λ} = σ^{2} λ$ . Thus, the parameters ω can be updated by:

$ω_{n e w} = (Φ D Φ^{T} + \bar{λ} I)^{- 1} Φ D Γ$ (8)

Instead of exploration with state-independent variance for RWR, PoWER augments a policy with state-dependent variance, and thus $π (γ | s) = N (γ | ϕ {(s)}^{T} ω, ϕ {(s)}^{T} Σ_{ω} ϕ (s))$ . Assume that $Σ_{ω}$ is known, the weighted maximum likelihood algorithm in Algorithm 1 can be generalized, and the update method for parameters ω can be summarized as:

$ω_{n e w} = ω_{o l d} + E {[\sum_{t = 0}^{T - 1} L_{t} (s) d_{t}]}^{- 1} E [\sum_{t = 0}^{T - 1} L_{t} (s) d_{t} є_{t}]$ (9)

where $L_{t} (s) = ϕ_{t} (s) ϕ_{t} {(s)}^{T} {(ϕ {(s)}^{T} Σ_{ω} ϕ_{t} (s))}^{- 1}$ and $є_{t} \sim N (0, Σ_{ω})$ .

For most policy-learning algorithms, designing good basic functions is challenging. Therefore, a kernel-based version of reward-weighted regression is introduced and called ‘Cost-regularized Kernel Regression’. By inserting Eq. (8) into $\bar{γ} = ϕ^{T} ω$ and using the Woodbury formula, the expectation of γ can be rewritten as:

$\begin{array}{l} \bar{γ} = & ϕ^{T} ω = ϕ^{T} {(Φ D Φ^{T} + \bar{λ} I)}^{- 1} Φ D Γ \\ = & ϕ^{T} Φ {(Φ^{T} Φ + \bar{λ} D^{- 1})}^{- 1} Γ \end{array}$ (10)

With the kernel function $k (x, x^{'})$ where $k (x, x^{'}) = ϕ {(x)}^{T} ϕ (x^{'})$ , we can obtain:

$\bar{λ} = k (s) (K + \bar{λ} C)^{- 1} Γ$ (11)

where $K = Φ^{T} Φ$ , $k (s) = ϕ {(s)}^{T} Φ$ and $C = D^{- 1}$ . In order to incorporate exploration, the CrKR refers to a stochastic policy whose variance is updated with a Gaussian process regression algorithm, and we can obtain the variance from:

$σ_{γ}^{2} = k (s, s) - k (s) (K + \bar{λ} C)^{- 1} k {(s)}^{T}$ (12)

CrKR utilizes kernel functionality and need not define parameterized policy and basis functions. With exploration in space $Γ$ , CrKR corresponds to the Gaussian process regression, where the costs on the diagonal are input-dependent noise priors.

3.2. Cost-regularized kernel regression with weighting exploration

As PoWER, we adopt state-dependent variance for the stochastic policy and apply the perturbation to parameters ω. Such a policy can be presented by:

$π (γ | s) = N (γ | ϕ {(s)}^{T} ω, ϕ {(s)}^{T} Σ_{ω} ϕ (s))$ (13)

Eq. (14) can be inserted into Eq. (6) to obtain:

$ω_{n e w} = {(Φ \tilde{D} Φ^{T} + \bar{λ} I)}^{- 1} Φ \tilde{D} Γ$ (14)

where $\tilde{D}$ is a diagonal weighting matrix with the entries of:

${\tilde{d}}^{[i]} = {(ϕ {(s^{[i]})}^{T} Σ_{ω} ϕ (s^{[i]}))}^{- 1} d^{[i]}$ (15)

for each episode i.

Therefore, with the weights' exploration, the kernel-based version of meta-parameter regression can be transformed to:

$\bar{γ} = k (s) {(K + λ \tilde{C} I)}^{- 1} Γ$ (16)

where $\tilde{C} = {\tilde{D}}^{- 1}$ . According to the Gaussian process regression, the variance can be obtained by:

$σ_{γ}^{2} = k (s, s) - k (s) {(K + λ \tilde{C})}^{- 1} k {(s)}^{T}$ (17)

Use of cost-regularized kernel regression with weighting exploration(We-CrKR) achieves the appropriate parameters by exploring within the weights' space instead of the parameters' space.

A simple simulated planar-cannon shooting is considered to benchmark the two reinforcement learning algorithms CrKR and We-CrKR. The 2D toy-cannon shooting game is described by Lawrence [22]. The simulation is set up on a planar with Stokes's drag and horizon wind model(the wind velocity equals 1m/s). The toy cannon is located at (0, 0) while the target is putted on the x -axis with a desired distance range of (0, 3). The initial velocity and the velocity direction are considered as the meta-parameters. The cost function is defined as:

$c = {(x - x_{t})}^{2}$ (18)

where x_t denotes the target position on the x -axis and x indicates the impact position.

CrKR and We-CrKR are the two algorithms that are compared with each other, and the performances are averaged over 10 complete learning runs. Twenty-five Gaussian functions are considered to be the basic function, with the desired position as the input state on a regular grid for each parameter. Two algorithms converge after 500 episodes. However, because the meta-parameters derived from CrKR perturb approximately before the completion of these 500 episodes, the costs of CrKR are slightly higher and have a larger standard deviation than those of the We-CrKR algorithm. Figure 1 describes the performance of the two methods: lines show the median and error bars indicate the standard deviation.

Figure 1.

Costs of CrKR and We-CrKR( $λ = 0.5$ ) for the toy cannon problem

4. Quadrotor Mathematical Model

The model of a quadrotor is presented as follows. Firstly, two coordinate systems are introduced: the inertial frame ℐ and the body frame ℬ, which are shown in Figure 2. The $O_{g} x_{g} y_{g} z_{g}$ denotes the body frame ℬ and $O x y z$ denotes the inertial frame ℐ.

$ξ = (x, y, z)$ denotes the position vector of the centre of mass in the inertial frame ℐ. $Θ = (ϕ, θ, ψ)$ denotes the Euler angle. ψ is the yaw angle around the z -axis, θ is the pitch angle around the x -axis and ϕ is the roll angle around the y -axis. With the definition of the yaw-pitch-roll Euler angle, the rotation matrix $R$ from the body frame to the inertial frame is obtained by:

$R = [\begin{matrix} c ψ c θ & c ψ s θ s ϕ - c ϕ s ψ & c ψ c ϕ s θ + s ψ s ϕ \\ c θ s ψ & c ψ c ϕ + s ψ s θ s ϕ & c ϕ s ψ s θ - c ψ c ϕ \\ - s θ & c θ s ϕ & c θ c ϕ \end{matrix}]$ (19)

Using the Newton-Euler formula, the dynamics of the quadrotor in a fixed inertial frame can be expressed by:

$\begin{array}{l} \dot{ξ} = v \\ m \dot{v} = F \\ \dot{R} = R \overset{⌢}{Ω} \\ Ι \dot{Ω} = - Ω \times IΩ + τ \end{array}$ (20)

where $v$ is the linear velocity in the inertial frame, $Ω$ denotes the angular velocity of the quadrotor in the body frame and $I$ indicates the inertia matrix around the centre of mass. $\overset{⌢}{Ω}$ is the skew-symmetric matrix of the $Ω$ . $τ = τ_{B} + τ_{G}$ , where $τ_{B} = (τ_{ϕ}, τ_{θ}, τ_{ψ})$ denotes the moments produced by rotors in the body frame,

$τ_{B} = [\begin{matrix} k l (w_{1}^{2} - w_{3}^{2}) \\ k l (w_{2}^{2} - w_{4}^{2}) \\ b (w_{1}^{2} + w_{3}^{2} - w_{3}^{2} - w_{4}^{2}) \end{matrix}]$ (21)

where $k > 0$ is a constant for the thrust produced by the rotor, $b > 0$ is a constant for quasi-stationary manoeuvres in free flight and w is the rotor speed for every motor. $τ_{G}$ represents the gyroscopic torques in the body frame.

$F$ represents the sum of the force applied to the rigid body in the inertial frame. In the body frame, thrust $T_{r}$ and gravity $T_{g}$ can be written as:

$\begin{array}{l} T_{r} = {[\begin{matrix} 0 & 0 & - k (\sum_{i = 1}^{4} ω_{i}^{2}) \end{matrix}]}^{T} \\ T_{g} = {[\begin{matrix} 0 & 0 & m g \end{matrix}]}^{T} \end{array}$ (22)

In the inertial frame, the force $F$ can be expressed by:

$F = R T_{r} + T_{g}$ (23)

Figure 2.

Coordinate System of the Quadrotor

5. Adaptive Trajectory-tracking Controller

The trajectory-tracking controller in reference [23] is utilized as the basic policy. The basic controller adopts the inner/outer-loop structure. A position-error PD closed-loop equation of the quadrotor constructs the relationship between the attitude and the linear acceleration. Subsequently, a command-filtered backstepping technique for tracking the attitude commanded signal was produced by the outer-loop position controller. The details are specified by Algorithm 2.

There are two command-filtered parameters, which are $\bar{T} = d i a g {t_{1}, t_{2}, t_{3}} > 0$ and $Λ = d i a g {λ_{1}, λ_{2}, λ_{3}} > 0$ . Moreover, the control parameters K_d, K_p, $K_{i 1}$ , $K_{i 2}$ are required to be positively definite in theory.

In non-linear control, the outputs bear a close relationship to the system parameters of the quadrotor, such as the quality of mass and the moment of inertia. However, the accurate system parameters are hardly ever obtained and, even more importantly, some of them change during flight. Another problem is external disturbances, such as gusts of wind.

In order to solve the first problem, the system parameters are considered as the meta-parameters for the reinforcement learning algorithm. According to the flight performance and temporal reward, the parameters can be updated to create a better reward and flight performance. The logic process of the adaptive controller is described by Figure 3. The inputs of the adaptive policy-search block include the states $s$ , and the block updates the parameters of the adaptive elements via policy-search algorithms.

Figure 3.

Adaptive trajectory-tracking control block diagram

Suppose that the accurate mass and moment of inertia are unknown, the meta-parameters γ consist of the mass and inertial moments in a non-linear controller:

$γ = {[m, I_{x}, I_{y}, I_{z}]}^{T}$ (24)

Algorithm 2

Trajectory tracking control with command-filtered compensation

Require: ζ_c: the commanded trajectory;

\dot{ζ} c

: the commanded velocity; ψ_c: the commanded yaw attitude; 1: for all steps do 2: Calculate the virtual control vector U by

U = \ddot{ζ} = {\ddot{ζ}}_{c} + K_{d} ({\dot{ζ}}_{c} - \dot{ζ}) + K_{p} (ζ_{c} - ζ) (II)

3: Get the commanded pitch θ_c and roll ϕ_c attitudes by

{\frac{θ_{c} = arctan (\frac{u_{1} cos ψ_{c} + u_{2} sin ψ_{c}}{u_{3} + g})}{ϕ_{c} = arcsin (\frac{u_{1} sin ψ_{c} - u_{2} cos ψ_{c}}{\sqrt{u_{1}^{2} + u_{2}^{2} + {(u_{3} + g)}^{2}}})}} (III)

4: Calculate the commanded angular velocity Ω_c and the filter error ɛ

Ω_{c} = - \bar{T} \int Ω_{c} - ω_{d} (IV)

ɛ = - K_{i 1} \int ɛ + W \int Ω_{c} - Ω_{d} (V)

W = [\begin{matrix} 1 sin ϕ tan θ cos ϕ tan θ \\ 0 cos ϕ - sin ϕ \\ 0 sin ϕ sec θ cos ϕ sec θ \end{matrix}] (VI)

5: Extract the commanded attitude derivative

\dot{Θ}

_c by

{\begin{array}{l} {\dot{X}}_{1} = X_{2} \\ {\dot{X}}_{2} = - 2 Λ X_{1} - Λ^{2} (X_{1} - Θ_{c}) \end{array} (VII)

{\dot{Θ}}_{c} = X_{2} (VIII)

6: The control inputs T and ⊤_a can be provided by

\begin{array}{l} T_{r} = m [u_{1} (sin θ cos ψ cos ϕ + sin ψ sin ϕ) \\ + u_{2} (sin θ sin ψ cos θ - cos ψ sin θ) \\ + (u_{3} + g) cos θ cos ϕ] \end{array} (IX)

\begin{array}{l} τ_{B} = (Ω \times I Ω) + G_{a} - I \bar{T} Ω_{c} + I \bar{T} W^{- 1} X_{2} \\ - I \bar{T} W^{- 1} K_{i 1} (Θ - Θ_{c}) \\ - I W^{T} (Θ - Θ_{c} - ɛ) - I K_{i 2} (Ω - Ω_{c}) \end{array} (X)

7: Execute the control input and sample the states of quadrotor. 8: end for;

We choose the states $s$ which are composed of $ξ - ξ_{c}$ , $\dot{ξ} - {\dot{ξ}}_{c}$ , $Θ - Θ_{c}$ and Ω. For CrKR and We-CrKR, the kernel function is defined by:

$k (s, s') = ϕ {(s)}^{T} ϕ (s')$ (25)

where $ϕ (s)$ is given by Gaussian basic functions of:

$ϕ_{i} (s) = ψ_{i} / \sum_{k = 1}^{N} ψ_{k}$ (26)

$ψ_{i} = \exp (- h_{i} {‖ s - c_{i} ‖}^{2})$ (27)

where c_i and h_i denote the centres and widths in phase space, respectively. The centres are spaced on a grid of states $s$ . Subsequently, we define the episodic reward by:

$\begin{array}{l} r (s, t) = & α_{1} \exp (- | ξ (t) - ξ_{c} (t) |) \\ & + α_{2} \exp (- | Θ (t) - Θ_{c} (t) |) \\ & + α_{3} \exp (- | {\dot{Θ}}_{c} (t) |) \end{array}$ (28)

$R = \sum_{t = 0}^{T - 1} r (s_{t}, t) / T$ (29)

where the $α_{1,2,3}$ denote weights for different parts of the temporal reward $r (s, t)$ and R denotes the episodic reward. In order to predict the accurate system parameters, CrKR and We-CRKR are appropriate for the parameters' direct adjustment without having to use specified basic functions.

Considering the effects of external disturbances, we add adaptive elements to the control parameters. In addition, the logic process is the same as the one presented above in Figure 3, but the adaptive elements denote different sections. We can define the new controller parameters with adaptive elements as: ${\bar{K}}_{m} = K_{m} + γ_{m}$ and $(m = p, d, i 1, i 2)$ . RWR and PoWER are utilized to learn the parameters $γ_{m}$ for external disturbance cancellation. Unlike the prediction of system parameters, the external disturbance is observed by the system's transformation of the state space; therefore, the policy $π (γ | s)$ needs to be learned, and parameters γ for adaptive elements are drawn from the policy in order to compensate for the effects of external disturbances.

6. Evaluations and Experiments

In this section, we consider the system parameters' adaptation and adaptive control for external disturbances within the framework of reinforcement learning. A quadrotor model is picked up to describe the performances of algorithms, and the basic parameters of the quadrotor are presented in Table 1 [24].

Table 1.

Quadrotor unmanned, mini-helicopter model parameters

Parameter	m	g	l	I_r	I_x
Value	0.468	9.81	0.225	3.357×10⁻⁵	4.856×10⁻³
Unit	kg	m/s ²	m	kgm²	kgm²
Parameter	I_y	I_z	b	k
Value	4.856×10⁻³	8.801×10⁻³	2.98×10⁻⁶	1.14×10⁻⁷
Unit	kgm ²	kgm ²	Ns²/rad²	Nms²/rad²

The basic control parameters are the same as those given in reference [23] and can be described as:

$\begin{array}{l} K_{d} = & d i a g {2,2,2} \\ K_{p} = & d i a g {1,1,1} \\ K_{i 1} = & d i a g {0.1,0.1,0.1} \\ K_{i 2} = & d i a g {0.1,0.1,0.1} \\ \bar{T} = & d i a g {15,15,15} \\ Λ = & d i a g {10,10,10} \end{array}$

and sampling time $Δ t = 0.01 s$ . The initial states: $ξ_{0} {= (0,0,0)}^{T}$ and $Θ_{0} {= (0,0,0)}^{T}$ . The weights $α$ in Equation (27) are defined by $α = [1, e^{- 0.2},0.2]$ in the following simulations.

In a real flight of a quadrotor helicopter, there are always some system parameters needing correction. Moreover, external disturbances are other unpredictable elements. We need to utilize the adaptive controller to overcome the effects of these factors, and two scenarios are considered: a quadrotor with inaccurate system parameters, and trajectory tracking for a quadrotor with external disturbances.

In Scenario 1, a quadrotor with inaccurate mass m and moments $I$ are provided and the meta-parameter learning frameworks CrKR and We-CrKR can be used to predict the correct system parameters. The flying vehicle starts from a hovering status and moves to a desired position. The planned trajectory is given by:

$ξ_{c} = (υ_{x} t, υ_{y} t, υ_{h} t)$ (30)

$υ = (1,1,1)$ (31)

where υ denotes the desired velocity. The yaw angle command $ψ_{c}$ is fixed at zero during a flight. In order to track the desired trajectory, we define the temporal reward function by Equation (28). In a flying episode, the accumulated return is presented by Equation (29). The real mass and inertial moment of the vehicle are presented by:

$m = 0.468 + 0.1 k g$ (32)

$I = [\begin{matrix} I_{x} + 5 \times 10^{- 4} & 0 & 0 \\ 0 & I_{y} + 5 \times 10^{- 4} & 0 \\ 0 & 0 & I_{z} + 5 \times 10^{- 4} \end{matrix}]$ (33)

while the initial system parameters are presented in Table 1. Therefore, two kernel-based regression algorithms are utilized to adjust the system parameters in the controller for better performance during flight.

Figure 4 describes the performances of two algorithms. The We-CrKR algorithm explores in the weights' space and has a quicker convergence than that of CrKR. Moreover, a higher reward is achieved by We-CrKR, with smaller variance after 500 episodes. The trajectory provided by the origin controller is presented by the green curve in Figure 5 and there is a relatively large gap from the expected trajectory. The CrKR and We-CrKR can both, finally, provide good performances for flight with the meta-parameters described in Table 2. In addition, the results of We-CrKR are closer to the real system parameters than those of CrKR.

Table 2.

Improved meta-parameters by CrKR and We-CrKR

	m	I_x	I_y	I_z
CrKR	0.5462	2.1287×10⁻⁴	2.5901×10⁻⁴	1.5189×10⁻⁴
We-CrKR	0.5573	6.7773×10⁻⁴	6.1669×10⁻⁴	2.5475×10⁻⁴

Figure 4.

Rewards of flight with CrKR and We-CrKR for inaccurate system parameters

Figure 5.

Trajectories of flight with CrKR, We-CrKR and an origin controller for inaccurate system parameters

For online learning problems, the real flight data cannot provide sufficient exploration in the parameters' space for direct model-free policy learning. Using a model-based policy search is a solution and, with supervised learning techniques, the simulation vehicle model can be achieved through the flight data, and then be based on simulation-model policy updates with CrKR or We-CrKR in order to obtain improved meta-parameters. The details of a model-based policy search are described in reference [21].

In Scenario 2, adaptive elements $a$ are introduced to compensate for the effects of external disturbances. External disturbances are always contributed by winds, such as a constant wind, gusts and a buffeting wind [15, 14, 17]. The wind or gusts of wind are presented as parts of an unobserved environment; therefore, we formulate the effects of wind by:

$\bar{\dot{ϕ}} = \dot{ϕ} + δ_{ϕ}$ (34)

$\bar{\dot{θ}} = \dot{θ} + δ_{θ}$ (35)

$\bar{\dot{z}} = \dot{z} + δ_{z}$ (36)

where $δ_{i} (i = ϕ, θ, z)$ denotes the effects of wind disturbance. The adaptive elements $a$ are added to the control parameters $K_{i} (p = d, p, i 1, i 2)$ , and the Equations (II),(X) can be rewritten as:

$U = \ddot{ξ} = {\ddot{ξ}}_{c} + (K_{d} + a_{d}) ({\dot{ξ}}_{c} - \dot{ξ}) + (K_{p} + a_{p}) (ξ_{c} - ξ)$ (37)

$\begin{array}{l} τ_{B} = & (Ω \times I Ω) + G_{a} - I \bar{T} Ω_{c} + I \bar{T} W^{- 1} X_{2} \\ & - I \bar{T} W^{- 1} (K_{i 1} + a_{i 1}) (Θ - Θ_{c}) \\ & - I W^{T} (Θ - Θ_{c} - ε) - (I K_{i 2} + a_{i 2}) (Ω - Ω_{c}) \end{array}$ (38)

Adaptive elements $a_{i} (i = d, p, i 1, i 2)$ are approximated by radial basis functions, which can be expressed by Equations (26) and (27). The tracking errors $e$ and $W$ are chosen to construct states' space variables $s$ and $a_{i}$ can be expressed by:

$a_{d} = ϕ_{d} {(\dot{ξ} - {\dot{ξ}}_{c})}^{T} ω_{d}$ (39)

$a_{p} = ϕ_{p} {(ξ - ξ_{c})}^{T} ω_{p}$ (40)

$a_{i 1} = ϕ_{i 1} {(Θ - Θ_{c})}^{T} ω_{i 1}$ (41)

$a_{i 2} = ϕ_{i 2} {(Ω)}^{T} ω_{i 2}$ (42)

For constant wind, we define $δ_{ϕ} = 1 r a d / s$ , $δ_{θ} = 1 r a d / s$ and $δ_{z} = 1 m / s$ for simulation. The two policy-search algorithms of RWR and PoWER are adapted to improve the performance of flight by adjusting the adaptive elements. An episode-based case is considered and the total flight time equals $20 s$ with initial static states and the planned trajectory that is presented by Equations (30) and (31).

In Figure 6, the results of PoWER and RWR are compared with each other. The red curve denotes the rewards for all roll-outs provided by PoWER, while the blue one presents the performance of RWR. The results are averaged over 10 episodes and the lines show the median values and error bars indicating standard deviation. It is obvious that the performance of PoWER is better than that of RWR. In addition, the final average reward of PoWER reaches about 5.8 while only reaching 5.45 for RWR.

Figure 6.

Rewards of flight with PoWER and RWR for constant wind

Figure 7 describes the trajectories of position for all three controllers. The red curve denotes the trajectory conducted by the adaptive controller, which is improved by the reinforcement-learning framework of PoWER. The adaptive controller cancels the effects of constant wind on the quadrotor, and a tiny tracking error is acquired. With trained parameters supplied by by RWR, the adaptive controller decreases the tracking error compared with that of the origin controller, but the error is larger than that of PoWER's controller.

Figure 7.

Trajectories of flight with PoWER, RWR and the origin controller for constant wind

Wind gusts are always considered as turbulence flows. The direction and strength of turbulence varies irregularly, and we can pick up the system noise in order to represent the effect of turbulence. Thus, define $δ_{i}$ by $δ_{ϕ} \sim N (0,1)$ , $δ_{θ} \sim N (0,1)$ and $δ_{z} \sim N (0,1)$ .

Figure 8 describes the rewards of two algorithms. Because of the turbulence, the two algorithms both have large variances. After approximately 3,000 episodes, the PoWER has converged and, moreover, yields a high performance with a reward of 5.9. RWR always improves the performance during all episodes, but finally yields a performance with a reward of 5.8. Without the adaptive elements, the origin controller cannot erase the effects of turbulence and a tracking error exists. Moreover, a smaller tracking error seems to be achieved by using PoWER (see Figure 9).

Figure 8.

Rewards of flight with PoWER and RWR for gusts wind

Figure 9.

Trajectories of flight with PoWER, RWR and an origin controller for wind gusts

In the last case, a buffeting wind that changes periodically is considered, and the effects of the wind $δ_{i}$ are defined by:

$\begin{array}{l} δ_{ϕ, θ} = & 0.1 (10 + 5 s i n (2 π t)) r a d / s \\ δ_{z} = & 0.01 (10 + 5 s i n (2 π t)) m / s \end{array}$ (43)

which are also considered in Reference [17].

With the same initial conditions as those of the previous simulations, after approximately 800 episodes, PoWER has converged, while RWR spends approximately 2,000 episodes on convergence. In addition, PoWER achieves a higher reward than that of RWR (see Figure 10(a)). Moreover, the trajectories of position for PoWER and RWR are described by Figure 10(b). Figure 10 shows the trajectories of states for PoWER and RWR, and the actions and angles are reachable by a real quadrotor.

Figure 10.

Trajectories of states with PoWER and RWR for buffeting wind: (a) Rewards for PoWER and RWR; (b) Positions for PoWER and RWR; (c) Euler Angles for PoWER; (d) Euler Angles for RWR; (e) Thrust for PoWER; (f) Thrust for RWR; (g) Torques for PoWER; (h) Torques for RWR

7. Conclusion

In this paper, we combined a command-filtered non-linear controller with reinforcement-learning techniques in order to construct an adaptive trajectory-tracking control algorithm, which will protect against inaccurate system parameters and external disturbances for a quadrotor helicopter. Within a reinforcement-learning framework, we separated the problem into two parts: meta-parameters' prediction and adaptive policy learning. Based on CrKR and PoWER, a new kernel-based regression called We-CrKR was provided with less variance in the initial phase. CrKR and We-CrKR were utilized for learning the meta-parameters for the quadrotor system, and better performance and less variance can be obtained by using We-CrKR. Moreover, PoWER and RWR were introduced for learning the adaptive elements. Numerical simulations for the two problems were performed and the results demonstrated the efficiency of the adaptive trajectory-tracking algorithm with reinforcement-learning techniques.

Footnotes

8. Acknowledgements

This work is supported by the China Postdoctoral Science Foundation Grant (Grant No.2014M560877) and the National Natural Science Foundation of China (No.61503010).

References

Holroyd

C B

Coles

M G

. The neural basis of human error processing: reinforcement learning, dopamine, and the error-related negativity. Psychological Review. 2002; 109(4):679–709: DOI: 10.1037//0033-295X.109.4.679.

Sutton

R S

Barto

A G

Williams

R J

. Reinforcement learning is direct adaptive optimal control. Control Systems, IEEE. 1992;12(2):19–22: DOI: 10.1109/37.126844.

Kim

Jordan

M I

Sastry

A Y

. Autonomous helicopter flight via reinforcement learning. In: Advances in Neural Information Processing Systems. 2003: DOI:10.1007/978-1-4899-7502-7-16-1.

Tedrake

Zhang

T W

Seung

H S

. Stochastic policy gradient reinforcement learning on a simple 3D biped. In: Intelligent Robots and Systems, 2004. (IROS 2004). Proceedings. 2004 IEEE/RSJ International Conference on 28 Sept.–2 Oct. 2004; Sendai, Japan; IEEE 2004; vol. 3, pp. 2849–2854: DOI:10.1109/IROS.2004.1389841.

Busoniu

Babuska

De Schutter

Ernst

. Reinforcement learning and dynamic programming using function approximators, vol. 39. CRC press; 2010.

Barto

A G

. Reinforcement learning: An introduction. Cambridge: MIT press; 1998.

Bagnell

J A

Schneider

J G

. Autonomous helicopter control using reinforcement learning policy search methods. In: Robotics and Automation, 2001. Proceedings 2001 ICRA. IEEE International Conference; 21–26 May, 2001; Seoul, Korea; IEEE 2001. pp. 1615–1620: DOI:10.1109/ROBOT.2001.932842.

Jimenez

A R

. Policy search approaches to reinforcement learning for quadruped locomotion [thesis]. Leslie P Kaelbling and Paul A DeBitetto: Massachusetts Institute of Technology; 2006.

Ijspeert

A J

Nakanishi

Schaal

. Learning attractor landscapes for learning motor primitives. In: Advances in Neural Information Processing Systems. 2003; 1523–1530: DOI:10.1016/j.neunet.2011.02.004.

10.

Kober

Peters

. Reinforcement learning in robotics: A survey. International Journal of Robotics Research. 2012; 32(11): 1238–1274: DOI: 10.1177/0278364913495721.

11.

Peters

Schaal

. Learning to control in operational space. The International Journal of Robotics Research. 2008;27(2): 197–212: DOI: 10.1177/0278364907087548.

12.

Kober

Wilhelm

Oztop

Peters

. Reinforcement learning to adjust parametrized motor primitives to new situations. Autonomous Robots. 2012; 33(4): 361–379: DOI: 10.1007/s10514-012-9290-3.

13.

Bouabdallah

Noth

Siegwart

. PID vs LQ control techniques applied to an indoor micro quadrotor. In: Intelligent Robots and Systems, 2004. Proceedings. 2004 IEEE/RSJ International Conference(IROS 2004); 28 Sep.–2 Oct., 2004; Sendal, Japan; IEEE, 2001. pp. 2451–2456: DOI: 10.1109/IROS. 2004. 1389776.

14.

Bouabdallah

Siegwart

. Backstepping and sliding-mode techniques applied to an indoor micro quadrotor. In: Robotics and Automation, 2005. (ICRA 2005). Proceedings of the 2005 IEEE International Conference; April, 2005; Barcelona, Spain; IEEE, 2005. pp. 2247–2252: DOI:10.1109/ROBOT.2005.1570447.

15.

Madani

Benallegue

. Backstepping sliding mode control applied to a miniature quadrotor flying robot. In: IEEE Industrial Electronics, IECON 2006–32nd Annual Conference on 6–10 Nov., 2006; Paris, France; IEEE 2006; pp. 700–705: DOI:10.1109/IECON.2006.347236.

16.

Madani

Benallegue

. Control of a quadrotor mini-helicopter via full state backstepping technique. In: Decision and Control, 2006 45th IEEE Conference on 13–15 Dec., 2006; San Diego, USA; IEEE 2006; pp. 1515–1520: DOI:10.1109/CDC.2006.377548.

17.

Nicol

Macnab

Ramirez-Serrano

. Robust neural network control of a quadrotor helicopter. In: Electrical and Computer Engineering, 2008. CCECE 2008. Canadian Conference on 4–7 May, 2008; Niagara Falls, ON; IEEE 2008; pp. 1233–1237: DOI: 10.1109/CCECE.2008.4564736.

18.

Raimúndez

Villaverde

A F

. Adaptive tracking control for a quadrotor. In: Sixth EUROMECH Nonlinear Dynamics Conference (ENOC 2008); 30 June–4 July, 2008; Saint Petersburg, Russia.

19.

Kober

Peters

. Policy search for motor primitives in robotics. Advances in Neural Information Processing Systems. 2011; 84(1–2): 171–203: DOI: 10.1007/s10994-010-5223-6.

20.

Kober

Oztop

Peters

. Reinforcement learning to adjust robot movements to new situations. Twenty-Second International Joint Conference on Artificial Intelligence (IJCAI 11); 16–22 July, 2011; Barcelona, Spain; IJCAI 2011. pp. 2650–2655: DOI: 10.1.1.165.8302.

21.

Deisenroth

M P

Neumann

Peters

. A survey on policy search for robotics. Foundations and Trends in Robotics. 2013; 2(1–2): 1–142:DOI: 10.1561/2300000021.

22.

Lawrence

Cowan

Russell

. Efficient gradient estimation for motor control learning. In: Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence; 7–10 Aug., 2003; Acapulco, Mexico; pp. 354–361: DOI:10.1.1.68.5139.

23.

Zuo

. Trajectory tracking control design with command-filtered compensation for a quadrotor. Control Theory & Applications, IET. 2010;4(11): 2343–2355: DOI:10.1049/iet-cta.2009.0336.

24.

Tayebi

McGilvray

. Attitude stabilization of a four-rotor aerial robot. In: Decision and Control, 2004. CDC. 43rd IEEE Conference on 14–17 Dec., 2004; Atlanta, USA; IEEE 2004; vol. 2, pp. 1216–1221: DOI:10.1109/CDC.2004.1430207.