ROVA: Robust Video Alignment for Real-World Reasoning

ROVA: Robust Video Alignment

Figure 2: Overview of ROVA framework — **Figure 2.** ROVA consists of (1) structured spatio-temporal corruption, (2) self-reflective difficulty-aware training, and (3) dual-branch alignment with GRPO.

Abstract

In real-world deployment, vision-language models often encounter disturbances such as weather, occlusion, and camera motion. Under such conditions, their understanding and reasoning degrade substantially — up to 35% accuracy drop and 28% reasoning quality drop in our evaluation. We propose ROVA, a training framework that improves robustness via a robustness-aware consistency reward under spatio-temporal corruptions. ROVA introduces a difficulty-aware online training strategy that prioritizes informative samples based on the model's evolving capability. We also introduce PVRBench, a new benchmark with 12 realistic perturbation types across 27 scene categories. ROVA boosts relative accuracy by at least 24% and reasoning by over 9% compared to baselines, while also improving performance on clean data.

Motivation: VLMs Fail Under Realistic Perturbations

Figure 1: Failure cases under occlusion and fog — **Figure 1.** Failure cases of Qwen2.5-VL under occlusion (left) and fog (right). The model incorrectly predicts “Turn Left” or “Turn Right” instead of the ground-truth “Go Ahead”.

Framework Components

&circled1; Structured Corruption

We inject temporally coherent, spatially grounded perturbations (weather, lighting, camera motion, occlusion) via learnable masks.

&circled2; Self-Reflective Difficulty

The model evaluates its own predictions on corrupted samples, discarding easy ones, storing hard ones in a memory buffer, and training on informative samples.

&circled3; Dual-Branch Alignment

Clean and perturbed branches are aligned via GRPO using format, accuracy, and reasoning-similarity rewards.

PVRBench: Perturbed Video Reasoning Benchmark

Table 1: Comparison with existing benchmarks — **Table 1.** PVRBench is the only benchmark with synthetic, spatial, and temporal perturbations across 27 scene categories.

Figure 3: Four perturbation categories — **Figure 3.** Examples of lighting, camera motion, occlusion, and weather perturbations in PVRBench.

Main Results on PVRBench

Model	Size	Light.	Occ.	Shake	Weather	Avg. Pert.	Clean	Reasoning↑
GPT-4o	-	0.47	0.50	0.52	0.51	0.50	0.59	3.82
Gemini-3-Pro	-	0.57	0.52	0.54	0.55	0.55	0.61	3.91
Video-R1 72B	72B	0.51	0.45	0.49	0.49	0.49	0.58	3.68
Embodied-R 7B	7B	0.45	0.38	0.42	0.43	0.42	0.54	3.45
ROVA (ours) 7B	7B	0.52	0.46	0.49	0.51	0.50	0.55	3.58
Qwen2.5-VL 7B	7B	0.35	0.28	0.34	0.34	0.33	0.51	3.41
+ ROVA 7B	7B	0.48	0.43	0.47	0.49	0.47	0.53	3.52
ROVA (ours) 72B	72B	0.57	0.53	0.56	0.56	0.56	0.59	3.72

Table 2. Accuracy under four perturbation types and reasoning quality (0–5 scale). ROVA consistently outperforms baselines.

Efficiency & Data Economy

Method	Data	GPUs	GPU-h	Avg. Acc.
Video-R1	425K	8×A100	339.2	0.49
Naïve Dual	32.5K	4×A100	142.8	0.48
ROVA	32.5K	4×A100	134.4	0.53

Table 3. ROVA achieves higher accuracy with 60% fewer GPU hours and 8× less data than Video-R1.

Ablation and Analysis

Figure 5a: Component ablation — **Figure 5a.** Contribution of each ROVA component. Reasoning reward gives the largest gain.

Figure 5b: Mask style generalization — **Figure 5b.** Structured masks generalize to unseen perturbations (red bars), outperforming random masking.

Figure 5c: Self-reflective evaluation dynamics — **Figure 5c.** Self-reflective evaluation: easy samples are increasingly discarded, difficult samples re-evaluated and re-classified as training progresses.

Generalization to Other Benchmarks

Figure 19: Gains on VisBench and UrbanVideo — **Figure 19.** ROVA improves accuracy on VisBench and UrbanVideo under perturbations by +14.6% and +12.9% on average, and also boosts clean performance.

Citation

@article{he2026rova,
  title={Are Video Reasoning Models Ready to Go Outside?},
  author={He, Yangfan and Boo, Changgyu and Yoon, Jaehong},
  journal={arXiv preprint arXiv:2601.18577},
  year={2026}
}