Robotis Mini robot learns to walk with Gazebo simulations

In this post I will develop the following main components for the Robotis Mini:

A parametrized walking trajectory generator
An interface to receive walking parameters and reset the Gazebo simulation
A program that explores the space of walking parameters of the robot by setting them and observing the “quality” of the simulated execution

The parametrization of the walking trajectory generator and parts of the code are based on the one existing for the big brother of the Robotis Mini robot, the Darwin robot. In this link you can learn more about it. The idea is to create the walking trajectory as a set of sinusoids. This defines the parameter space of the trajectory as the parameters that define these sinusoids. Based on the aforementioned code I defined a walking trajectory generator and expose its parameters to be assigned with ROS calls.

Then, I created a program that sends walking parameters and observes the height and distance traveled by the robot’s head during execution. These values are the reward associated to the walking parameters. In a first step I just grid searched for an acceptable set of parameters that make the robot to move without falling. I found a good set of parameters which I defined to be the default ones at the beginning of the execution.

Finally, to close the loop, I tested this first set of parameters found in simulation on the real robot. The idea is to be able to switch seamlessly between simulation and real robot. The execution of this first set of parameters, while not very graceful, really matches the simulation! Remember that this set of parameters is just the best scoring of a grid search, I did not implemented yet any real learning method to improve the search.

In the next post I will report on the steps to implement a real machine learning approach to learn the optimal set of walking parameters. The technique that I will be using is black box optimization (instead of reinforcement learning) because:

It is easy to implement
Requires a single scalar value as reward for an entire rollout (execution of a trajectory)
Explores in the parameter space of the actions, not in the outcome of the policy
The exploration of new parameters is performed with noise on the parameters that is constant (wrt time) along the entire trajectory

Learning to walk with a Robotis Mini