FastTD3: Simple, Fast, and Capable Reinforcement Learning for Humanoid Control

Abstract

Reinforcement learning (RL) has driven significant progress in robotics, but its complexity and long training times remain major bottlenecks. In this report, we introduce FastTD3, a simple, fast, and capable RL algorithm that significantly speeds up training for humanoid robots in popular benchmarks such as HumanoidBench, IsaacLab, and MuJoCo Playground.

Our recipe is remarkably simple: by training a standard off-policy TD3 agent with parallel simulation, large-batch updates, and carefully tuned hyperparameters, FastTD3 solves a range of HumanoidBench tasks in under 3 hours on a single GPU. We also provide a lightweight and easy-to-use implementation of FastTD3 to accelerate RL research in robotics.

FastTD3: Simple, Fast, and Capable RL for Humanoid Control

FastTD3 is a high-performance variant of the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm, optimized for complex robotics tasks. These optimizations are based on the observations of Parallel Q-Learning (PQL) that found parallel simulation, large batch sizes, and distributional critic are important for scaling up off-policy RL algorithms.

Please check our technical report for various design choices we have made and their effects!

Results: MuJoCo Playground

FastTD3 solves a range of tasks in MuJoCo Playground, achieving similar or faster wall-clock times than PPO.

Results: HumanoidBench

FastTD3 solves various tasks in HumanoidBench in under 3 hours on a single GPU, where prior algorithms struggle even after tens of hours of training. However, one can see that behaviors are very unnatural due to the limitation of the reward design.

Sim-to-real Transfer with FastTD3

We also show that FastTD3 can be used for sim-to-real transfer. We used amazingly fast and convenient MuJoCo Playground to train a Booster T1 policy in simulation and transferred it to a real Booster T1 robot using Booster Gym.

Simulation Rollout

Real Robot Rollout with FastTD3

Simulation Rollout

Real Robot Rollout with FastTD3

Different RL algorithms may need different reward functions

In MuJoCo Playground, we made an interesting observation where FastTD3 train policies with a notably different gait than PPO even though we use the same reward function for both algorithms. We hypothesize that this is because existing reward functions are typically tuned for PPO, therefore different algorithms may need different reward functions. So we have tuned reward functions to induce a desirable gait for FastTD3 -- with stronger penalty terms and pose constraint.

PPO with PPO reward

FastTD3 with PPO reward

FastTD3 with FastTD3 reward

As shown in the video, with this tuned reward, we were able to train a nice-looking gait with FastTD3, even though resulting episode returns are the same for both policies. This observation suggests that typical metric the research community uses -- episode return -- may not fully reflect the usefulness of each RL algorithm. Therefore, we encourage users to reconsider the use of robotic simulation as a benchmark for comparing different classes of RL algorithms without evaluating the quality of the learned gaits, since each RL algorithm may require different reward functions, especially for tasks with heavily hand-tuned rewards.

Get Started with FastTD3 Codebase

We provide an easy-to-use implementation of FastTD3 that supports various user-friendly features to help you get started with RL research in robotics. Please check out the GitHub repository for more details.

BibTeX

@article{seo2025fasttd3,
  author    = {Seo, Younggyo and Sferrazza, Carmelo and Geng, Haoran and Nauman, Michal and Yin, Zhao-Heng and Abbeel, Pieter},
  title     = {FastTD3: Simple, Fast, and Capable Reinforcement Learning for Humanoid Control},
  journal   = {arXiv:2505.22642}
  year      = {2025},
}