๐ฆฟ Training Locomotion Policies with RL#
Genesis supports parallel simulation, making it ideal for training reinforcement learning (RL) locomotion policies efficiently. In this tutorial, we will walk you through a complete training example for obtaining a basic locomotion policy that enables a Unitree Go2 Robot to walk.
This is a simple and minimal example that demonstrates a very basic RL training pipeline in Genesis, and with the following example you will be able to obtain a quadruped locomotion policy thatโs deployable to a real robot very quickly.
Note: This is NOT a comprehensive locomotion policy training pipeline. It uses simplified reward terms to get you started easily, and does not exploit Genesisโs speed on big batchsizes, so it only serves basic demonstration purposes.
Acknowledgement: This tutorial is inspired by and builds several core concepts from Legged Gym.
Environment Overview#
We start by creating a gym-style environment (go2-env).
Initialize#
The __init__
function sets up the simulation environment with the following steps:
Control Frequency. The simulation runs at 50 Hz, matching the real robotโs control frequency. To further bridge sim2real gap, we also manually simulate the action latecy (~20ms, one dt) shown on the real robot.
Scene Creation. A simulation scene is created, including the robot and a static plane.
PD Controller Setup. Motors are first identified based on their names. Stiffness and damping are then set for each motor.
Reward Registration. Reward functions, defined in the configuration, are registered to guide the policy. These functions will be explained in the โRewardโ section.
Buffer Initialization. Buffers are initialized to store environment states, observations, and rewards
Reset#
The reset_idx
function resets the initial pose and state buffers of the specified environments. This ensures robots start from predefined configurations, crucial for consistent training.
Step#
The step
function takes the action for execution and returns new observations and rewards. Here is how it works:
Action Execution. The input action will be clipped, rescaled, and added on top of default motor positions. The transformed action, representing target joint positions, will then be sent to the robot controller for one-step execution.
State Updates. Robot states, such as joint positions and velocities, are retrieved and stored in buffers.
Termination Checks. Environments are terminated if (1) Episode length exceeds the maximum allowed (2) The robotโs body orientation deviates significantly. Terminated environments are reset automatically.
Reward Computation.
Observation Computation. Observation used for training includes base angular velocity, projected gravity, commands, dof position, dof velocity, and previous actions.
Reward#
Reward functions are critical for policy guidance. In this example, we use:
tracking_lin_vel: Tracking of linear velocity commands (xy axes)
tracking_ang_vel: Tracking of angular velocity commands (yaw)
lin_vel_z: Penalize z axis base linear velocity
action_rate: Penalize changes in actions
base_height: Penalize base height away from target
similar_to_default: Encourage the robot pose to be similar to the default pose
Training#
At this stage, we have defined the environments. Now, we use the PPO implementation from rsl-rl to train the policy. Follow these installation steps:
# Install rsl_rl.
git clone https://github.com/leggedrobotics/rsl_rl
cd rsl_rl && git checkout v1.0.2 && pip install -e .
# Install tensorboard.
pip install tensorboard
After installation, start training by running:
python examples/locomotion/go2_train.py
To monitor the training process, launch TensorBoard:
tensorboard --logdir logs
You should see a training curve similar to this:
Evaluation#
Finally, letโs roll out the trained policy. Run the following command:
python examples/locomotion/go2_eval.py
You should see a GUI similar to this:
If you happen to have a real Unitree Go2 robot by your side, you can try to deploy the policy. Have fun!