Adapting Generalist Robot Policies with
Semantic Reinforcement Learning

Jagdeep Singh Bhatia, Andrew Wagenmaker, William Chen, Sergey Levine

University of California, Berkeley

Paper

SARL teaser: RL over semantic actions (language prompts).

TL;DR We propose optimizing the prompt inputs of generalist robot policies with reinforcement learning, enabling efficient real-robot adaptation on complex & long-horizon tasks, where existing methods for improving robot behavior in deployment struggle.

The Challenge

How can we adapt generalist policies at deployment time to go beyond their pre-training?

Vision-language-action (VLA) models learn broad skill repertoires from pretraining, giving us powerful priors over plausible robot behaviors. The standard strategy is to just prompt these models with new, desired task goals. However, as tasks scale in complexity and horizon, a successful policy must implicitly decompose goals into atomic, executable behaviors, and ground each behavior in a skill the robot can actually pull off in its current context. Often, directly prompting a VLA with a challenging goal fails.

Why prompting fails

Long-horizon prompts are out of support

Ask a VLA to move the hammer to the plate, then grasp the mushroom and its action distribution collapses into the wrong modes — it moves the mushroom to the plate and grabs the spoon instead.

Why a VLM isn't enough

VLMs decompose, but don't ground

A VLM can break the goal into sensible sub-instructions, but has no idea which commands actually work on this robot. Saying grasp the hammer makes the VLA grab a spoon; only move down and grasp succeeds. Plausible ≠ grounded.

Why we need RL

RL grounds language in behavior

The only way to know which prompts actually make progress is to try them on the robot and learn from the result. SARL runs RL over the prompt space, learning a grounded mapping between semantic actions and physical behaviors, achieving both decomposition and grounding.

Semantic Action RL (SARL)

SARL runs RL over a VLA's language prompts, lifting learning from robot actions to the semantic level.

Rather than viewing a VLA as a policy to be statically prompted, we view it as a semantically controllable action prior that can be dynamically guided throughout deployment. This motivates a simple but powerful transformation of the RL problem: instead of learning over the robot action space \(\mathcal{A}_{\mathrm{robot}}\) (joint positions, end-effector deltas, etc.), we learn over a semantic action space \(\mathcal{A}_{\mathrm{sem}}\) — the space of language commands — and deploy the VLA as a transformation between the two. We call the resulting induced problem a semantic MDP.

Concretely, at each step SARL picks a semantic action (a prompt) \(\ell\), the VLA turns it into robot actions, and the environment transitions and returns a reward. SARL learns a semantic Q-function \(Q_{\mathrm{sem}}(s, \ell)\) via temporal-difference backups that estimates how effective each prompt is at making task progress, and acts by sampling from the softmax over these Q-values. Finally, naively searching over all possible prompts is intractable, so SARL uses a VLM to propose a small set of candidate semantic actions from the current observation and task.

Algorithm 1 · Semantic Action RL (SARL)

Initialize the semantic Q-function \(Q_{\mathrm{sem}}(s, \ell)\) and an empty replay buffer \(\mathcal{B}\).
For each step \(t = 1, 2, 3, \ldots\) :
1. Query the VLM for a small set of candidate prompts \(\mathcal{A}_{\mathrm{sem}}\) from the current observation and task goal.
2. Sample a prompt \(\ell\) from \(\mathrm{softmax}(Q_{\mathrm{sem}}(s, \cdot))\) — prompts with higher Q-values are more likely.
3. The VLA executes \(\ell\) as robot actions; observe reward \(r\) and next state \(s'\).
4. Add the transition \((s, \ell, r, s')\) to the buffer \(\mathcal{B}\).
5. Update \(Q_{\mathrm{sem}}\) toward \(r + V_{\mathrm{sem}}(s')\) with TD backups over \(\mathcal{B}\).

Tasks & Rollouts

Task suite. We evaluate on four complex, long-horizon WidowX tasks (above), plus ten tasks on the simulated LIBERO-10 benchmark. These tasks are multi-step and require composing skills seen during pretraining (Bridge V2 for real, LIBERO-90 for sim) — the base policies achieve near-0% success on most of them, making them a clean test of deployment-time adaptation.

Task 1 — Move the hammer to the plate. Then, grasp the mushroom.

SARL (Ours)

ICL VLM

DSRL

Residual RL

Zero-shot Base

Task 2 — Move the banana to the pot on the right and sushi to the bowl.

SARL (Ours)

ICL VLM

DSRL

Residual RL

Zero-shot Base

Task 3 — Move the left stuffed toy to the largest bowl.

SARL (Ours)

ICL VLM

DSRL

Residual RL

Zero-shot Base

Task 4 — Move the pot lid to the towel.

SARL (Ours)

ICL VLM

DSRL

Residual RL

Zero-shot Base

Results

SARL improves the base VLA's success rate from near 0% under the task prompt up to ~80% after only 60–100 online episodes, in both the real world and simulation. It beats action-space RL methods (DSRL, Residual RL) that are fundamentally limited by the base policy's behavior under a single fixed prompt, and it beats an in-context-learning VLM baseline that proposes plausible commands but struggles to ground them in physical behavior.

Real-world WidowX. Across all four long-horizon tasks, SARL achieves the best improvement of generalist-policy behavior in deployment. Each data point represents 10 evaluations.

Simulated LIBERO-10 results across ten tasks.

Simulated LIBERO-10. On long-horizon LIBERO-10 tasks, SARL outperforms DSRL, Residual RL, and the ICL VLM. It successfully adapts the policy on five tasks and matches performance on another that is already near-solved. Each point represents 64 evaluations with standard error over 3 seeds.

SARL vs. Action-Space RL: Editing prompts unlocks new behavior

SARL vs. action-space RL. On complex, long-horizon tasks the base policy fails when zero-shot prompted — its action distribution collapses into entirely incorrect modes. Action-space methods can only filter or nudge around that distribution, so they only learn when the base policy is already close to succeeding. By editing the prompt, SARL reaches regions of the VLA's behavioral prior that are inaccessible to action-only steering, sequencing skills covered under pretraining to solve the task.

SARL vs. VLM: Grounding is as important as decomposition

SARL vs. VLM prompting. An in-context-learning VLM picks commands that are semantically meaningful but lack grounding, leading to failures: it references an out-of-distribution “hammer” by name (making the VLA grab a spoon), or picks the wrong verb (drop), prematurely releasing the sushi. Through many episodes of experience, SARL learns the grounded behavior each command induces and selects the prompt that actually works.

Training Timelapses (SARL)

Watch SARL learn each task online, sped up 30x.

Task 1 — Move the hammer to the plate, then grasp the mushroom

Task 2 — Move the banana to the pot and sushi to the bowl

Task 3 — Move the left stuffed toy to the largest bowl

Task 4 — Move the pot lid to the towel

BibTeX

@misc{bhatia2026sarl,
  author    = {Bhatia, Jagdeep Singh and Wagenmaker, Andrew and Chen, William and Levine, Sergey},
  title     = {Adapting Generalist Robot Policies with Semantic Reinforcement Learning},
  year      = {2026},
}

Adapting Generalist Robot Policies withSemantic Reinforcement Learning

The Challenge

Long-horizon prompts are out of support

VLMs decompose, but don't ground

RL grounds language in behavior

Semantic Action RL (SARL)

Tasks & Rollouts

Results

Training Timelapses (SARL)

BibTeX

Adapting Generalist Robot Policies with
Semantic Reinforcement Learning