# onevl
**Repository Path**: cxh110/onevl
## Basic Information
- **Project Name**: onevl
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: MulanPSL-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 1
- **Created**: 2026-05-14
- **Last Updated**: 2026-05-18
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
#

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanations
[](https://arxiv.org/abs/2604.18486/)
[](https://xiaomi-embodied-intelligence.github.io/OneVL/)
[](https://huggingface.co/collections/xiaomi-research/onevl-models/)
[](LICENSE)
---
## Overview
**OneVL** is a Vision-Language-Action (VLA) framework for autonomous driving that achieves **state-of-the-art trajectory prediction accuracy** with **inference latency matching answer-only AR models**. It overcomes the fundamental limitations of prior latent Chain-of-Thought (CoT) methods by introducing dual-modal auxiliary decoders that supervise compact latent tokens to encode both linguistic reasoning and future scene dynamics.
### Three CoT Paradigms
> **(a) Explicit CoT** generates a full reasoning chain before the answer — interpretable but slow. **(b) Implicit CoT** compresses reasoning into opaque latent vectors — fast but not interpretable. **(c) OneVL (ours)** uses visual latent tokens `v` and language latent tokens `l`; during training, dual auxiliary decoders decode these into future frames and CoT text respectively. At inference, decoders are discarded and latents are **prefilled** into the prompt — matching the speed of (b) while recovering the interpretability of (a) in both vision and language.
### Architecture
> During training, hidden states at visual latent positions are routed to the **Visual Aux. Decoder** (predicts future-frame visual tokens at t+0.5s and t+1.0s) and at language latent positions to the **Language Aux. Decoder** (reconstructs CoT text). Both decoders are discarded at inference; all latent tokens are **prefilled** into the prompt, matching answer-only AR prediction latency.
OneVL augments **Qwen3-VL-4B-Instruct** with:
- **Latent Token Interface** — 4 visual latent tokens + 2 language latent tokens placed in the assistant response before the answer, using existing vocabulary tokens (no new special tokens).
- **Visual Auxiliary Decoder** — Predicts future-frame visual tokens at t+0.5s and t+1.0s from visual latent hidden states (Emu3.5 IBQ, 131k codebook), acting as a **world model** supervision signal.
- **Language Auxiliary Decoder** — Reconstructs explicit CoT reasoning text from language latent hidden states, conditioned on ViT visual features.
- **Prefill Inference** — Both decoders are discarded at inference; latent tokens are processed in one parallel pass with only the trajectory generated autoregressively.
### Key Innovations
- **Dual-Modal Auxiliary Decoders**: A *language auxiliary decoder* reconstructs human-readable CoT reasoning from language latent tokens; a *visual auxiliary decoder* predicts future scene frames from visual latent tokens, acting as a **world model** that grounds the latents in physical scene dynamics.
- **Prefill Inference**: All latent tokens are prefilled into the prompt context in a single parallel pass — **1.5× faster than explicit CoT on NAVSIM, 2.3× faster on ROADWork** — with latency essentially identical to answer-only AR prediction.
- **Compression Drives Generalization**: OneVL is the **only latent CoT method that outperforms explicit autoregressive CoT** across all four benchmarks.
---
## Open-Source Status
| Component | Status |
|-----------|--------|
| 📄 Technical Report | ✅ [Tech report](https://arxiv.org/abs/2604.18486) |
| ⚖️ Model Weights | ✅ [Weights](https://huggingface.co/collections/xiaomi-research/onevl-models) |
| 🔍 Inference Code | ✅ [Code](https://github.com/xiaomi-research/onevl)|
| 🏋️ Training Code | ✅ [Code](https://github.com/GeorgeLuImmortal/OneVL_training/tree/main) |
---
## Results
### Accuracy–Efficiency Pareto (NAVSIM & ROADWork)
> OneVL lands in the **green-shaded optimal corner** (lowest latency, best metric) on both benchmarks. All prior latent CoT methods (COCONUT, CODI, SIM-CoT) underperform even the AR Answer baseline on driving tasks — a critical failure that OneVL overcomes.
### NAVSIM — Full Comparison
| Method | Model Size | PDM-score ↑ | Latency (s) ↓ | Interpretability |
|--------|:----------:|:-----------:|:-------------:|:----------------:|
| AdaThinkDrive | 8B | 86.20 | — | Language |
| LaST-VLA | 8B | 87.30 | — | — |
| AR Answer | 4B | 87.47 | 4.49 | — |
| AR CoT+Answer | 4B | 88.29 | 6.58 | Language |
| COCONUT | 4B | 84.84 | 5.93 | — |
| CODI | 4B | 83.92 | 8.62 | — |
| SIM-CoT | 4B | 84.21 | 10.86 | Language |
| **OneVL** | **4B** | **88.84** | **4.46** | **Vision + Language** |
### ROADWork — Full Comparison
| Method | ADE (px) ↓ | FDE (px) ↓ | Latency (s) ↓ | Interpretability |
|--------|:----------:|:----------:|:-------------:|:----------------:|
| YNet | 22.68 | 80.78 | — | — |
| AR Answer | 15.98 | 40.29 | 4.74 | — |
| AR CoT+Answer | 13.18 | 29.98 | 10.74 | Language |
| COCONUT | 15.44 | 38.60 | 6.06 | — |
| CODI | 16.45 | 44.28 | 6.73 | — |
| SIM-CoT | 16.49 | 44.32 | 6.19 | Language |
| **OneVL** | **12.49** | **28.80** | **4.71** | **Vision + Language** |
### Impromptu — Full Comparison
| Method | ADE (m) ↓ | FDE (m) ↓ | Latency (s) ↓ | Interpretability |
|--------|:---------:|:---------:|:-------------:|:----------------:|
| Impromptu VLA | 1.60 | 4.28 | 6.10 | — |
| AR Answer | 1.46 | 4.03 | 4.24 | — |
| AR CoT+Answer | 1.42 | 3.96 | 6.84 | Language |
| COCONUT | 1.49 | 4.07 | 5.27 | — |
| CODI | 1.86 | 5.18 | 5.24 | — |
| SIM-CoT | 2.43 | 6.10 | 5.09 | Language |
| **OneVL** | **1.34** | **3.70** | **4.02** | **Vision + Language** |
### APR1 — Full Comparison
| Method | ADE (m) ↓ | FDE (m) ↓ | Latency (s) ↓ | Interpretability |
|--------|:---------:|:---------:|:-------------:|:----------------:|
| Cosmos-Reason | 2.86 | **7.42** | — | Language |
| AR Answer | 3.27 | 9.59 | 3.06 | — |
| AR CoT+Answer | 2.99 | 8.54 | 3.51 | Language |
| COCONUT | 3.29 | 9.48 | 3.76 | — |
| CODI | 3.22 | 9.25 | 3.85 | — |
| SIM-CoT | 3.40 | 9.85 | 3.78 | Language |
| **OneVL** | **2.62** | 7.53 | **3.26** | **Vision + Language** |
### Text CoT Quality (NAVSIM)
| Method | Meta Action Acc. ↑ | STS Score ↑ | LLM Judge ↑ | Avg. ↑ | Latency (s) ↓ |
|--------|:-----------------:|:-----------:|:-----------:|:------:|:------:|
| AR CoT+Answer | 73.20 | 79.75 | 81.86 | **78.27** | 6.58 |
| SIM-CoT | 67.20 | 76.25 | 78.73 | 74.06 | 10.86 |
| **OneVL** (lang. aux.) | 71.00 | 78.26 | 79.13 | 76.13 | **4.46** |
OneVL's language auxiliary decoder recovers 97% of explicit CoT quality while running at answer-only speed.
### Ablation Study (NAVSIM PDM-score)
| Model Variant | Lang. Aux. Dec. | Vis. Aux. Dec. | Staged Train | PDM-score ↑ |
|---------------|:---------------:|:--------------:|:------------:|:-----------:|
| OneVL w/o vis. dec. | ✓ | — | ✓ | 87.97 |
| OneVL w/o lang. dec. | — | ✓ | ✓ | 88.53 |
| OneVL w/o staged train | ✓ | ✓ | — | 67.13 |
| **OneVL (full)** | **✓** | **✓** | **✓** | **88.84** |
Both auxiliary decoders contribute measurably; staged training is essential (without it, performance collapses to 67.13).
---
## Qualitative Examples
### NAVSIM
> Each plot overlays ground-truth (green) and predicted (red) trajectories on the front camera view, along with predicted future frames at t+0.5s and t+1.0s decoded from the visual auxiliary decoder, and the language CoT from the language auxiliary decoder.
### ROADWork (Construction Zone Navigation)
---
## Environment Setup
**Requirements:** Python 3.10+, CUDA GPU (≥16 GB VRAM recommended for inference with aux decoders).
```bash
# 1. Create and activate virtual environment
uv venv venv/onevl --python 3.12
source venv/onevl/bin/activate
# 2. Install dependencies
pip install -r requirements.txt
```
Core packages (`requirements.txt`):
```
torch==2.10.0
torchvision==0.25.0
transformers==4.57.0
safetensors==0.7.0
Pillow>=10.0.0
omegaconf>=2.3.0
einops>=0.7.0
numpy>=1.24.0
```
> **Note:** `transformers ≥ 4.57.0` is required for `Qwen3VLForConditionalGeneration` support.
---
## Inference
### Quick Start (Single GPU)
```bash
source venv/onevl/bin/activate
# Trajectory prediction only (fastest, prefill inference)
python infer_onevl.py \
--model_path /path/to/OneVL-checkpoint \
--test_set_path test_data/navsim_test.json \
--image_base_path ""
--output_path output/navsim/results.json \
--device cuda:0 \
--num_latent 2 --num_latent_vis 4 \
--max_new_tokens 1024 --answer_prefix "[" --prefix_k 0
# With language explanation (text CoT from language aux decoder)
python infer_onevl.py \
--model_path /path/to/OneVL-checkpoint \
--test_set_path test_data/navsim_test.json \
--image_base_path ""
--output_path output/navsim/results_explain.json \
--device cuda:0 \
--num_latent 2 --num_latent_vis 4 \
--max_new_tokens 1024 --answer_prefix "[" --prefix_k 0 \
--decoder_explain --aux_visual_condition \
--c_thought 2 --max_explain_tokens 1024
# With both language + visual explanation (text CoT + future frame tokens)
python infer_onevl.py \
--model_path /path/to/OneVL-checkpoint \
--test_set_path test_data/navsim_test.json \
--image_base_path "" \
--output_path output/navsim/results_explain.json \
--device cuda:0 \
--num_latent 2 --num_latent_vis 4 \
--max_new_tokens 1024 --answer_prefix "[" --prefix_k 0 \
--decoder_explain --aux_visual_condition \
--c_thought 2 --max_explain_tokens 1024 \
--visual_decoder_explain --visual_aux_visual_condition \
--c_thought_visual 4 --max_visual_tokens 2560
```
### Multi-GPU Inference (recommended for full test sets)
```bash
export MODEL_PATH=/path/to/OneVL-checkpoint
export TEST_SET_PATH=test_data/navsim_test.json
export OUTPUT_PATH=output/navsim/navsim_results.json
bash run_infer.sh
```
The launcher auto-detects available GPUs, shards the test set, runs inference in parallel across all GPUs, and merges results.
### Per-Benchmark Scripts
```bash
bash scripts/infer_navsim.sh # NAVSIM
bash scripts/infer_ar1.sh # APR1 (trajectory only)
bash scripts/infer_roadwork.sh # ROADWork
bash scripts/infer_impromptu.sh # Impromptu
```
### For visual cot/text cot explain
```bash
bash scripts/infer_ar1_explain.sh # APR1 (language + visual explanations, use APR1 as example)
```
### Evaluation
AR1, Impromptu, and ROADWork can be evaluated directly with the bundled evaluation script:
```bash
# AR1
python eval_results.py ar1 \
--results_json output/ar1/ar1_results.json \
--test_jsonl test_data/ar1_test.jsonl
# Impromptu
python eval_results.py impromptu \
--results_json output/impromptu/impromptu_results.json \
--test_jsonl test_data/impromptu_test.jsonl
# ROADWork
python eval_results.py roadwork \
--json_path output/roadwork/roadwork_results.json
```
NAVSIM uses the official NAVSIM evaluation pipeline. First convert OneVL inference results to the NAVSIM test format, then evaluate the converted file with the [NAVSIM](https://github.com/autonomousvision/navsim) codebase:
```bash
python output/navsim/convert_to_eval.py \
--input_path output/navsim/navsim_results.json \
--ref_path output/navsim/navsim_results_eval.json \
--output_path output/navsim/navsim_results_for_eval.json
```
---
## Visualizing Future-Frame Predictions
After running inference with `--visual_decoder_explain`, the output JSON contains `visual_decoder_explain` fields encoding predicted future-frame visual tokens. Use the visualization script to decode them back to images:
```bash
source venv/onevl/bin/activate
python scripts/visualize_predict_image_tokens.py \
--predict_json output/ar1_explain/ar1_results_explain.json \
--out_dir output/ar1_explain_visualize \
--model_root /path/to/emu35_model_root \
-n 20 \
--device cuda:0
```
**Output layout per sample:**
```
output/ar1_explain_visualize/
└── sample_0000/
├── input_00.jpg # original camera frame(s)
├── input_01.jpg
├── ...
├── decoded_from_tokens_00.png # predicted future frame at t+0.5s
├── decoded_from_tokens_01.png # predicted future frame at t+1.0s
└── meta.json # CoT text + metadata
```
The script uses the self-contained `vq_decoder/` module (bundled Emu3.5 IBQ VQ-VAE) — no external Emu3.5 repo dependency required.
`--model_root` must contain `Emu3.5-VisionTokenizer/config.yaml` and `Emu3.5-VisionTokenizer/model.ckpt`. Download from [BAAI/Emu3.5-VisionTokenizer](https://huggingface.co/BAAI/Emu3.5-VisionTokenizer).
---
## Test Data Format
### JSON array (NAVSIM, ROADWork)
```json
[
{
"messages": [{"role": "user", "content": "Based on the current image, predict ..."}],
"images": ["path/to/frame.jpg"],
"GT": "[[1.0, 0.0], [2.5, 0.1], ...]"
}
]
```
### JSONL (APR1, Impromptu)
One JSON object per line, same schema as above.
---
**Environment variables** accepted by all scripts:
| Variable | Default | Description |
|----------|---------|-------------|
| `MODEL_PATH` | *(required)* | Path to the OneVL checkpoint |
| `TEST_SET_PATH` | *(required)* | Test JSON / JSONL file |
| `OUTPUT_PATH` | `/infer_results/onevl_merged.json` | Where to write merged results |
| `IMAGE_BASE_PATH` | `""` | Prepended to relative image paths |
| `NUM_LATENT` | `2` | Number of language latent tokens |
| `NUM_LATENT_VIS` | `4` | Number of visual latent tokens |
| `MAX_NEW_TOKENS` | `1024` | Max answer tokens to generate |
| `ANSWER_PREFIX` | `""` | Prefix after `` (e.g. `[` for NAVSIM, `[[` for APR1) |
| `PREFIX_K` | `0` | Prefill first K GT waypoints after `` (default: 0), only used on ROADWork |
| `DECODER_EXPLAIN` | `false` | Enable language auxiliary decoder |
| `AUX_VISUAL_CONDITION` | `true` | *(if DECODER_EXPLAIN=true)* Condition language aux decoder on ViT features (`--aux_visual_condition`) |
| `C_THOUGHT` | `2` | *(if DECODER_EXPLAIN=true)* Number of latent tokens read by language aux decoder |
| `MAX_EXPLAIN_TOKENS` | `1024` | *(if DECODER_EXPLAIN=true)* Max tokens generated by language aux decoder |
| `VISUAL_DECODER_EXPLAIN` | `false` | Enable visual auxiliary decoder |
| `VISUAL_AUX_VISUAL_CONDITION` | `true` | *(if VISUAL_DECODER_EXPLAIN=true)* Condition visual aux decoder on ViT features (`--visual_aux_visual_condition`) |
| `C_THOUGHT_VISUAL` | `4` | *(if VISUAL_DECODER_EXPLAIN=true)* Number of latent tokens read by visual aux decoder |
| `MAX_VISUAL_TOKENS` | `2560` | *(if VISUAL_DECODER_EXPLAIN=true)* Max visual tokens generated by visual aux decoder |
---
## Citation
If you find this work useful, please cite:
```bibtex
@article{lu2026onevl,
title={OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation},
author={Lu, Jinghui and Guan, Jiayi and Huang, Zhijian and Li, Jinlong and Li, Guang and Kong, Lingdong and Li, Yingyan and Wang, Han and Xu, Shaoqing and Luo, Yuechen and others},
journal={arXiv preprint arXiv:2604.18486},
year={2026},
url={https://arxiv.org/abs/2604.18486}
}
```
---
## License
This project is released under the [Apache 2.0 License](LICENSE).
Model weights are built on [Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) and the visual tokenizer is from [Emu3.5-VisionTokenizer](https://huggingface.co/BAAI/Emu3.5-VisionTokenizer); please refer to their respective licenses as well.
---
## Acknowledgements
- [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL) — backbone VLM
- [Emu3.5](https://github.com/baaivision/Emu3) — IBQ visual tokenizer
- [AdaThinkDrive](https://github.com/luo-yc17/AdaThinkDrive/tree/main) — NAVSIM CoT annotations
- [NAVSIM](https://github.com/autonomousvision/navsim), [ROADWork](https://github.com/vita-epfl/roadwork), [Impromptu](https://github.com/Xiaomi-CHI/Impromptu) — evaluation benchmarks