dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model

Yaxuan Li¹ Zhongyi Zhou¹ Yefei Chen¹ Yaokai Xue¹ Yichen Zhu²

¹Current Robotics ²University of Toronto

Abstract

Evaluating generalist robot manipulation policies is costly and difficult to scale in the real world. While emerging world models (e.g., WorldEval, Ctrl-World) offer a promising alternative, the reliability of such evaluation remains a critical bottleneck. Specifically, their visual predictions can undermine policy assessment by "self-correcting" failures into false positives or yielding artifacts under out-of-distribution controls. Even with failure-enriched data, current architectures struggle to capture action-causal dynamics, as they typically treat actions as passive conditions rather than causal drivers. To address this, we propose dWorldEval, an action-centric discrete-diffusion world model that maps visual observations, language instructions, and action chunks into a shared unified token space and denoises them with a single self-attention backbone where actions function as first-class tokens. To realize reliable policy-world interaction, dWorldEval introduces a sparse keyframe memory that anchors global scene state while preserving fine-grained multi-view interaction cues, and leverages Progress-as-text to jointly generate future observations and success indicators. Extensive experiments on LIBERO, RoboTwin, and real-robot tasks demonstrate that dWorldEval significantly outperforms video diffusion baselines in action controllability, stabilizes long-horizon multi-view rollouts, enabling accurate policy ranking via automatic success estimation.
Overview

Architecture of dWorldEval

Real Execution vs. World Model Rollouts

1. LIBERO Tasks

Pick up the black bowl on the cookie box and place it on the plate.

Real Execution

3rd-person

Wrist

Generated Rollout

3rd-person

Wrist

Pick up the black bowl on the cookie box and place it on the plate.

Real Execution

3rd-person

Wrist

Generated Rollout

3rd-person

Wrist

Pick up bowl from center place on plate.

Real Execution

3rd-person

Wrist

Generated Rollout

3rd-person

Wrist

Pick up bowl next to plate place on plate.

Real Execution

3rd-person

Wrist

Generated Rollout

3rd-person

Wrist

2. RoboTwin Tasks

Use one arm to pick up the can and move it to beside the pot.

Real Execution

Top

Left

Right

Generated Rollout

Top

Left

Right

If there is one bread on the table, use one arm to grab the bread and put it into the skillet.

Real Execution

Top

Left

Right

Generated Rollout

Top

Left

Right

If there is one bread on the table, use one arm to grab the bread and put it in the basket.

Real Execution

Top

Left

Right

Generated Rollout

Top

Left

Right

Use two arms to simultaneously grab two breads and put them in the basket.

Real Execution

Top

Left

Right

Generated Rollout

Top

Left

Right

Place the container onto the plate.

Real Execution

Top

Left

Right

Generated Rollout

Top

Left

Right

Use an arm to place the empty cup on the coaster.

Real Execution

Top

Left

Right

Generated Rollout

Top

Left

Right

Stack the bowls together using two arms.

Real Execution

Top

Left

Right

Generated Rollout

Top

Left

Right

Use one arm to pick up the can and move it to beside the pot.

Real Execution

Top

Left

Right

Generated Rollout

Top

Left

Right

3. Real-World Tasks

Pick up the hammer, then strike the red block.

Real Execution

Top

Left

Right

Generated Rollout

Top

Left

Right

Pass the red block to the right arm to place it on the blue mat.

Real Execution

Top

Left

Right

Generated Rollout

Top

Left

Right

Place the empty blue cup to the cup mat.

Real Execution

Top

Left

Right

Generated Rollout

Top

Left

Right

Pick up one bottle with one arm, and pick up another bottle with the other arm.

Real Execution

Top

Left

Right

Generated Rollout

Top

Left

Right

Clean the table.

Real Execution

Top

Left

Right

Generated Rollout

Top

Left

Right

Clean the table.

Real Execution

Top

Left

Right

Generated Rollout

Top

Left

Right

Clean the table.

Real Execution

Top

Left

Right

Generated Rollout

Top

Left

Right

Clean the table.

Real Execution

Top

Left

Right

Generated Rollout

Top

Left

Right

Clean the table.

Real Execution

Top

Left

Right

Generated Rollout

Top

Left

Right

Clean the table.

Real Execution

Top

Left

Right

Generated Rollout

Top

Left

Right

Clean the table.

Real Execution

Top

Left

Right

Generated Rollout

Top

Left

Right

Clean the table.

Real Execution

Top

Left

Right

Generated Rollout

Top

Left

Right

Experimental Results

Real vs. dWorldEval success rates.
Action controllability: dWorldEval vs. Baselines.

Spatiotemporal Consistency: dWorldEval vs. Baselines