Predicting a sequence of actions has been crucial in the success of recent behavior cloning algorithms in robotics. Can similar ideas improve reinforcement learning (RL)? We answer affirmatively by observing that incorporating action sequences when predicting ground-truth return-to-go leads to lower validation loss. Motivated by this, we introduce Coarse-to-fine Q-Network with Action Sequence (CQN-AS), a novel value-based RL algorithm that learns a critic network that outputs Q-values over a sequence of actions, i.e., explicitly training the value function to learn the consequence of executing action sequences. Our experiments show that CQN-AS outperforms several baselines on a variety of sparse-reward humanoid control and tabletop manipulation tasks from BiGym and RLBench.
Recent behavior cloning algorithms in robotics like Action Chunking Transformer (ACT) and Diffusion Policy predict a sequence of actions. Can we use the similar idea for improving RL? Our hypothesis is that using action sequences can help value learning because action sequences, which can correspond to behavioral primitives such as going straight, make it easier for the model to learn the long-term outcomes compared to evaluating the effect of individual single-step actions.
We build our algorithm upon Coarse-to-fine Q-Network (CQN), a recent critic-only RL algorithm that solves continuous control tasks with discrete actions. (a) In CQN framework, we train RL agents to zoom-into the continuous action space by iterating the procedures of (i) discretizing the continuous action space into B bins and (ii) finding the bin with the highest Q-value to further discretize at the next level. We then use the last level's action sequence for controlling robots. CQN-AS extends this idea to action sequences by computing actions for all sequence steps k ∈ [1, ..., K] in parallel. (b) We train a critic network to output Q-values over a sequence of actions. We design our architecture to first obtain features for each sequence step and aggregate features from multiple sequence steps with a recurrent network. We then project these outputs into Q-values.
We study CQN-AS on 45 robotic tasks from BiGym and RLBench on demo-driven RL setup. CQN-AS achieves consistently outperforms various RL and BC baselines such as CQN, DrQ-v2+ (which is highly-optimized variant of DrQ-v2), and Action Chunking Transformer (ACT). In particular, CQN-AS significantly outperforms other demo-driven RL baselines on humanoid control tasks, where the value learning is more challenging.