Are Video Reasoning Models Ready to Go Outside?

1 Nanyang Technological University, Singapore 2 Korea University, South Korea
Corresponding author

ROVA: Robust Video Alignment

Figure 2: Overview of ROVA framework
Figure 2. ROVA consists of (1) structured spatio-temporal corruption, (2) self-reflective difficulty-aware training, and (3) dual-branch alignment with GRPO.

Abstract

In real-world deployment, vision-language models often encounter disturbances such as weather, occlusion, and camera motion. Under such conditions, their understanding and reasoning degrade substantially — up to 35% accuracy drop and 28% reasoning quality drop in our evaluation. We propose ROVA, a training framework that improves robustness via a robustness-aware consistency reward under spatio-temporal corruptions. ROVA introduces a difficulty-aware online training strategy that prioritizes informative samples based on the model's evolving capability. We also introduce PVRBench, a new benchmark with 12 realistic perturbation types across 27 scene categories. ROVA boosts relative accuracy by at least 24% and reasoning by over 9% compared to baselines, while also improving performance on clean data.

Motivation: VLMs Fail Under Realistic Perturbations

Figure 1: Failure cases under occlusion and fog
Figure 1. Failure cases of Qwen2.5-VL under occlusion (left) and fog (right). The model incorrectly predicts “Turn Left” or “Turn Right” instead of the ground-truth “Go Ahead”.

Framework Components

&circled1; Structured Corruption

We inject temporally coherent, spatially grounded perturbations (weather, lighting, camera motion, occlusion) via learnable masks.

&circled2; Self-Reflective Difficulty

The model evaluates its own predictions on corrupted samples, discarding easy ones, storing hard ones in a memory buffer, and training on informative samples.

&circled3; Dual-Branch Alignment

Clean and perturbed branches are aligned via GRPO using format, accuracy, and reasoning-similarity rewards.

PVRBench: Perturbed Video Reasoning Benchmark

Table 1: Comparison with existing benchmarks
Table 1. PVRBench is the only benchmark with synthetic, spatial, and temporal perturbations across 27 scene categories.
Figure 3: Four perturbation categories
Figure 3. Examples of lighting, camera motion, occlusion, and weather perturbations in PVRBench.

Main Results on PVRBench

ModelSizeLight.Occ.ShakeWeatherAvg. Pert.CleanReasoning↑
GPT-4o-0.470.500.520.510.500.593.82
Gemini-3-Pro-0.570.520.540.550.550.613.91
Video-R1 72B72B0.510.450.490.490.490.583.68
Embodied-R 7B7B0.450.380.420.430.420.543.45
ROVA (ours) 7B7B0.520.460.490.510.500.553.58
Qwen2.5-VL 7B7B0.350.280.340.340.330.513.41
+ ROVA 7B7B0.480.430.470.490.470.533.52
ROVA (ours) 72B72B0.570.530.560.560.560.593.72
Table 2. Accuracy under four perturbation types and reasoning quality (0–5 scale). ROVA consistently outperforms baselines.

Efficiency & Data Economy

MethodDataGPUsGPU-hAvg. Acc.
Video-R1425K8×A100339.20.49
Naïve Dual32.5K4×A100142.80.48
ROVA32.5K4×A100134.40.53
Table 3. ROVA achieves higher accuracy with 60% fewer GPU hours and 8× less data than Video-R1.

Ablation and Analysis

Figure 5a: Component ablation
Figure 5a. Contribution of each ROVA component. Reasoning reward gives the largest gain.
Figure 5b: Mask style generalization
Figure 5b. Structured masks generalize to unseen perturbations (red bars), outperforming random masking.
Figure 5c: Self-reflective evaluation dynamics
Figure 5c. Self-reflective evaluation: easy samples are increasingly discarded, difficult samples re-evaluated and re-classified as training progresses.

Generalization to Other Benchmarks

Figure 19: Gains on VisBench and UrbanVideo
Figure 19. ROVA improves accuracy on VisBench and UrbanVideo under perturbations by +14.6% and +12.9% on average, and also boosts clean performance.

Citation

@article{he2026rova,
  title={Are Video Reasoning Models Ready to Go Outside?},
  author={He, Yangfan and Boo, Changgyu and Yoon, Jaehong},
  journal={arXiv preprint arXiv:2601.18577},
  year={2026}
}