Are Video Reasoning Models
Ready to Go Outside?

ROVA — Robust Video Alignment for Real-World Video Reasoning
1 Nanyang Technological University, Singapore 2 Korea University, South Korea
Corresponding author

TL;DR

In real-world deployment, vision-language models often encounter disturbances such as weather, occlusion, and camera motion. Under such conditions, their understanding and reasoning degrade substantially. We propose ROVA, a training framework that improves robustness via a robustness-aware consistency reward under spatio-temporal corruptions. ROVA introduces a difficulty-aware online training strategy that prioritizes informative samples based on the model's evolving capability. We also introduce PVRBench, a new benchmark with 12 realistic perturbation types across 27 scene categories for evaluating accuracy and reasoning quality under realistic disturbances.

24%+ Accuracy Boost
9%+ Reasoning Gain
60% Fewer GPU Hours

ROVA: Robust Video Alignment

Figure 2: ROVA Framework Overview

Figure 2. ROVA consists of three stages: (1) Structured spatio-temporal corruption generates realistic perturbations; (2) Self-reflective difficulty-aware training continuously re-estimates sample difficulty via a memory buffer; (3) Dual-branch alignment uses clean and corrupted paths with GRPO to optimize robustness-aware consistency, format, and accuracy rewards.

VLMs Fail Under Realistic Perturbations

Figure 1: Failure cases under occlusion and fog

Figure 1. Failure cases of state-of-the-art VLMs (e.g., Qwen2.5-VL) under occlusion (left) and fog (right). The model incorrectly predicts “Turn Left” or “Turn Right” instead of the ground-truth “Go Ahead”, demonstrating how visual corruptions drastically impair video reasoning capability.

Framework Components

🌧️

① Structured Corruption

Injects temporally coherent, spatially grounded perturbations — weather effects (rain, snow, hail, storm), photometric shifts (dusk, night, overexposure), camera shake, and random occlusion — to simulate realistic visual disturbances.

🧠

② Self-Reflective Difficulty

The model evaluates its own predictions on corrupted samples. Easy samples are discarded; hard ones are stored in a memory buffer and periodically re-evaluated, ensuring training focuses on the most informative examples.

🔀

③ Dual-Branch Alignment

Clean and perturbed branches share weights and are jointly optimized via GRPO using format, accuracy, and robustness-aware consistency rewards to minimize reasoning divergence.

PVRBench: Perturbed Video Reasoning Benchmark

Table 1: Comparison with existing benchmarks

Table 1. Comparison with existing video benchmarks. PVRBench is the first to provide synthetic, spatial, and temporal perturbations across 27 scene categories with 9K videos and 52K QA pairs, enabling comprehensive robustness evaluation.

Main Results on PVRBench

Table 2: Main results on PVRBench

Table 2. Answer accuracy and reasoning quality across four perturbation types on PVRBench. ROVA consistently outperforms both proprietary models (GPT-4o, Gemini-3-Pro, Claude-3.5-Sonnet) and open-source video reasoning models (Video-R1, Embodied-R, LLaVA-Video) at the 7B scale, while the 72B variant achieves state-of-the-art performance.

Training Efficiency & Data Economy

Table 3: Efficiency comparison

Table 3. ROVA achieves higher accuracy using only 32.5K training samples and 4×A100 GPUs (134 GPU-hours) — 60% fewer GPU hours and 8× less data than Video-R1 (425K samples, 8×A100, 339 GPU-hours).

Ablation Studies

Figure 5a: Component ablation

Figure 5a. Contribution of each ROVA component. The robustness‑aware consistency reward yields the largest individual gain, and combining all components achieves the best overall performance.

Figure 5b: Mask style generalization

Figure 5b. Structured masks generalize to unseen perturbation types (hatched bars), consistently outperforming random masking across all corruption categories.

Figure 5c: Self-reflective evaluation dynamics

Figure 5c. Dynamics of self‑reflective evaluation throughout training. As the model improves, easy samples are progressively discarded while difficult samples are periodically re‑evaluated and re‑classified, ensuring the training curriculum continuously adapts to the model’s evolving capability.

Cross-Benchmark Generalization

Figure 19: VisBench and UrbanVideo results

Figure 19. ROVA generalizes beyond PVRBench: accuracy improvements of +14.6% on VisBench and +12.9% on UrbanVideo under perturbations, with additional gains on clean (unperturbed) evaluation as well. Both 7B and 13B configurations show consistent improvements across all perturbation types.

Case Study: Reasoning Under Rain

Case Study: Navigation and Trajectory Planning under Heavy Rain

Case Study. Two qualitative examples of ROVA reasoning under heavy rain perturbation. Left: Navigation under heavy rain — the model correctly identifies the path from a park to a store through occluded urban scenes. Right: Trajectory planning — the model reasons about spatial relationships in rain-degraded video frames to determine the correct approach direction toward a tower crane. Both examples demonstrate coherent question understanding, accurate answer generation, and faithful chain-of-thought reasoning despite severe visual degradation.

Citation

If you find this work useful in your research, please consider citing:

@article{he2026rova,
  title   = {Are Video Reasoning Models Ready to Go Outside?},
  author  = {He, Yangfan and Boo, Changgyu and Yoon, Jaehong},
  journal = {arXiv preprint arXiv:2603.10652},
  year    = {2026}
}