Robot foundation models keep hiding behind fine-tuning numbers. Wall-OSS-0.5 is trying a different approach

Most robot foundation model demos are hard to interpret because the impressive number usually comes after task-specific fine tuning. Wall-OSS-0.5, a new open-source VLA release from X Square Robot, is interesting because the report tries to measure what the pretrained checkpoint can do before that extra adaptation step.

The setup is a 4B vision-language-action model built around a 3B VLM backbone plus action-generation components. According to the report, the pretrained checkpoint was evaluated on a 17-task real-robot suite without task-specific fine tuning. Four tasks crossed 80 task progress: block sorting, fruit sorting, ring stacking, and a held-out deformable task, rope tightening.

The part that seems more important than the raw score is the framing. In language models, nobody would accept only a fine-tuned downstream score as evidence that pretraining worked. With robots, that has been much harder because the evaluation is physical, slow, embodiment-dependent, and expensive. A real-robot zero-shot suite is a useful step toward asking the same question directly: does pretraining itself produce executable behavior, or is it mostly a better initialization?

The method is also trying to solve a specific training problem. Continuous action losses are useful for execution, but the paper argues they do not send a strong enough learning signal into the VLM backbone by themselves. Their recipe combines action-token cross entropy, multimodal cross entropy, and flow matching in one stage, using the discrete action-token path as a gradient bridge into the backbone while flow matching handles continuous actions at deployment time.

For reference, the code is at https://github.com/X-Square-Robot/wall-x, the paper is at https://x2robot.com/api/files/file/wall_oss_05.pdf, the project page is https://x2robot.com/oss#resources, and the Hugging Face org is https://huggingface.co/x-square-robot.

The caveat is obvious but important. Zero-shot still does not solve the hardest manipulation tasks. The report says towel folding, table setting and charger insertion remain very low before fine tuning, which is probably the right boundary to pay attention to. Still, seeing a robot model release lead with pre-finetune real-hardware numbers feels like a healthier direction for embodied AI than another clean one-minute demo.

The open question is whether this is the right way to evaluate robot foundation models, or whether real-robot zero-shot suites are still too embodiment-specific to become a useful standard.

submitted by /u/breadislifeee
[link] [comments]