Control of a Quadrotor with Reinforcement Learning

Control of a Quadrotor with Reinforcement Learning

November 25, 2019 13 By Stanley Isaacs


In this video, we present the controlling
of a quadrotor with reinforcement learning. We use reinforcement learning techniques to
train our control policy. Two neural networks are used during training,
namely a value network and a policy network. The value network is a function representing
how good a particular state is. We use this network to guide policy training. The policy network is what actually controls
the quadrotor. Given the full state, it directly outputs
the thrust forces to be generated by each rotor. The two networks are updated at every learning
iteration. Each learning iteration consists of an exploration,
a value update and a policy update. During exploration, we mostly follow the current
policy in order to evaluate the value of the visited states. However, if we only use the current policy,
it is unlikely to find a better policy. Therefore, from time-to-time, we add noise
to the action in order to observe the outcomes of actions different from those of the current
policy. After obtaining data from exploration, we
update the value function. The value function lets us interpolate the
data so that we can estimate a value for any currently unvisited state. Every time we add noise, we accumulate a pair
of trajectories: one that followed the current policy, and the other where we added noise
for the first time step. The two paths are evaluated according to two
sources of information: the costs it observed on a trajectory and the value of the state
where it ended up. The policy is updated towards the better of
the two in a given pair. Training is performed in simulation, where,
we can train the networks in a fast and safe way. In simulation, we place a quadrotor in a random
configuration and velocity. The goal is to approach the red point as quickly
as possible. We can see that, over more iterations, the
performance is visibly better. However, performance did not significantly
change after 200 iterations, but, we simply used more iterations because training requires
a relatively short time. Without any parameter tuning after training
in simulation, we deployed the trained model on a real quadrotor. Here, we simply demonstrate that we can give
a position command to the quadrotor. The policy can also handle disturbances that
it was not trained on. The most notable feature of the policy is
its stability. It can stabilize even in very challenging
configurations. Here we demonstrate this by throwing the quadrotor
into many different attitudes and velocities. It also works when it starts nearly entirely
in an upside down position.