Coarse-to-fine Q-Network with Action Sequence for Data-Efficient Robot Learning

UC Berkeley

Abstract

Predicting a sequence of actions has been crucial in the success of recent behavior cloning algorithms in robotics. Can similar ideas improve reinforcement learning (RL)? We answer affirmatively by observing that incorporating action sequences when predicting ground-truth return-to-go leads to lower validation loss. Motivated by this, we introduce Coarse-to-fine Q-Network with Action Sequence (CQN-AS), a novel value-based RL algorithm that learns a critic network that outputs Q-values over a sequence of actions, i.e., explicitly training the value function to learn the consequence of executing action sequences. Our experiments show that CQN-AS outperforms several baselines on a variety of sparse-reward humanoid control and tabletop manipulation tasks from BiGym and RLBench.


Value Learning with Action Sequences

Recent behavior cloning algorithms in robotics like Action Chunking Transformer (ACT) and Diffusion Policy predict a sequence of actions. Can we use the similar idea for improving RL? Our hypothesis is that using action sequences can help value learning because action sequences, which can correspond to behavioral primitives such as going straight, make it easier for the model to learn the long-term outcomes compared to evaluating the effect of individual single-step actions.

  • (a) To investigate this hypothesis, we train a return-to-go prediction model with different action sequence lengths. We find that using action sequence of length 50 results in the lower validation loss than using single-step action.
  • (b) Inspired by this, we train actor-critic algorithms with action sequences. However, we find that SAC and TD3 with action sequences suffer from severe value overestimation in stand task from HumanoidBench (Sferrazza et al., 2024).
  • (c) This leads to unstable training and failure to solve the task. These findings motivate us to design CQN-AS, a novel RL algorithm that incorporates action sequences for value learning while avoiding the value overestimation problem.

Method Overview: Coarse-to-fine Q-Network with Action Sequence

We build our algorithm upon Coarse-to-fine Q-Network (CQN), a recent critic-only RL algorithm that solves continuous control tasks with discrete actions. (a) In CQN framework, we train RL agents to zoom-into the continuous action space by iterating the procedures of (i) discretizing the continuous action space into B bins and (ii) finding the bin with the highest Q-value to further discretize at the next level. We then use the last level's action sequence for controlling robots. CQN-AS extends this idea to action sequences by computing actions for all sequence steps k ∈ [1, ..., K] in parallel. (b) We train a critic network to output Q-values over a sequence of actions. We design our architecture to first obtain features for each sequence step and aggregate features from multiple sequence steps with a recurrent network. We then project these outputs into Q-values.

Experiments

Experimental Results: Overview

We study CQN-AS on 45 robotic tasks from BiGym and RLBench on demo-driven RL setup. CQN-AS achieves consistently outperforms various RL and BC baselines such as CQN, DrQ-v2+ (which is highly-optimized variant of DrQ-v2), and Action Chunking Transformer (ACT). In particular, CQN-AS significantly outperforms other demo-driven RL baselines on humanoid control tasks, where the value learning is more challenging.