ROVA: Robust Video Alignment
Abstract
In real-world deployment, vision-language models often encounter disturbances such as weather, occlusion, and camera motion. Under such conditions, their understanding and reasoning degrade substantially — up to 35% accuracy drop and 28% reasoning quality drop in our evaluation. We propose ROVA, a training framework that improves robustness via a robustness-aware consistency reward under spatio-temporal corruptions. ROVA introduces a difficulty-aware online training strategy that prioritizes informative samples based on the model's evolving capability. We also introduce PVRBench, a new benchmark with 12 realistic perturbation types across 27 scene categories. ROVA boosts relative accuracy by at least 24% and reasoning by over 9% compared to baselines, while also improving performance on clean data.
Motivation: VLMs Fail Under Realistic Perturbations
Framework Components
&circled1; Structured Corruption
We inject temporally coherent, spatially grounded perturbations (weather, lighting, camera motion, occlusion) via learnable masks.
&circled2; Self-Reflective Difficulty
The model evaluates its own predictions on corrupted samples, discarding easy ones, storing hard ones in a memory buffer, and training on informative samples.
&circled3; Dual-Branch Alignment
Clean and perturbed branches are aligned via GRPO using format, accuracy, and reasoning-similarity rewards.
PVRBench: Perturbed Video Reasoning Benchmark
Main Results on PVRBench
| Model | Size | Light. | Occ. | Shake | Weather | Avg. Pert. | Clean | Reasoning↑ |
|---|---|---|---|---|---|---|---|---|
| GPT-4o | - | 0.47 | 0.50 | 0.52 | 0.51 | 0.50 | 0.59 | 3.82 |
| Gemini-3-Pro | - | 0.57 | 0.52 | 0.54 | 0.55 | 0.55 | 0.61 | 3.91 |
| Video-R1 72B | 72B | 0.51 | 0.45 | 0.49 | 0.49 | 0.49 | 0.58 | 3.68 |
| Embodied-R 7B | 7B | 0.45 | 0.38 | 0.42 | 0.43 | 0.42 | 0.54 | 3.45 |
| ROVA (ours) 7B | 7B | 0.52 | 0.46 | 0.49 | 0.51 | 0.50 | 0.55 | 3.58 |
| Qwen2.5-VL 7B | 7B | 0.35 | 0.28 | 0.34 | 0.34 | 0.33 | 0.51 | 3.41 |
| + ROVA 7B | 7B | 0.48 | 0.43 | 0.47 | 0.49 | 0.47 | 0.53 | 3.52 |
| ROVA (ours) 72B | 72B | 0.57 | 0.53 | 0.56 | 0.56 | 0.56 | 0.59 | 3.72 |
Efficiency & Data Economy
| Method | Data | GPUs | GPU-h | Avg. Acc. |
|---|---|---|---|---|
| Video-R1 | 425K | 8×A100 | 339.2 | 0.49 |
| Naïve Dual | 32.5K | 4×A100 | 142.8 | 0.48 |
| ROVA | 32.5K | 4×A100 | 134.4 | 0.53 |
Ablation and Analysis
Generalization to Other Benchmarks
Citation
@article{he2026rova,
title={Are Video Reasoning Models Ready to Go Outside?},
author={He, Yangfan and Boo, Changgyu and Yoon, Jaehong},
journal={arXiv preprint arXiv:2601.18577},
year={2026}
}