[SPATIALUNCERTAIN]

Seeing Isn’t Knowing:
Do VLMs Know When Not to Answer Spatial Questions (and Why)?

Yue Zhang1, Zun Wang1, Han Lin1, Yonatan Bitton2, Idan Szpektor2, Mohit Bansal1

1UNC Chapel Hill · 2Google Research

Three conditions in SPATIALUNCERTAIN: Clean View (answerable), Occlusion (missing information, unanswerable), and Perspective Ambiguity (unreliable information, requires view selection).
From a 3D scene, we render multiple 2D observations and construct three conditions: (a) Clean View — sufficient information, answerable; (b) Occlusion — missing information, requires abstention; (c) Perspective Ambiguity — unreliable information, requires abstention and informative-view selection.

Abstract

Visual observations are inherently limited views of a 3D world — occlusion hides objects and perspective makes geometry unreliable — yet existing spatial reasoning benchmarks assume observations are sufficient, evaluating only answer correctness rather than whether models know when a question cannot be reliably answered. We introduce SPATIALUNCERTAIN, a controlled 3D benchmark with two observation challenges — occlusion (hiding target information) and perspective ambiguity (misleading visual cues) — where each question is answerable under clean views but requires abstention under the challenge condition. Across frontier VLMs (GPT-4o, GPT-5.4, Gemini-3.0-Flash, Qwen2.5-VL, InternVL), we find two consistent failure modes: overconfident answering (~30% accuracy under occlusion, <10% under perspective ambiguity) and near-random performance on selecting viewpoints that would resolve ambiguity. Structured prompting partially improves abstention but trades off answerable accuracy, while fine-tuning on diverse ambiguity conditions yields more robust observational uncertainty — suggesting this capability is learnable but requires exposure to varied uncertainty signals.

Constructing SpatialUncertain

Occlusion Pipeline

Step 1

3D Scene

🎯 Target: Full-length mirror

A clean 3D indoor scene with the target object fully visible.

Step 2

+ Occluder Injected

🎯 Target: Full-length mirror 🧱 Occluder: Wardrobe

An occluder object is added between the camera and the target.

Step 3

Views & QA Generation

clean rendered view

Clean: answerable → ""

occluded rendered view

Occluded: abstain → "Cannot Determine"

Q:

Perspective Pipeline

Step 1

Selected Object Pair

🖼️ Selected pair: Two artworks

A pair of comparable objects (e.g. two artworks) is chosen in the 3D scene.

Step 2 · Task 1

QA: Clean vs. Ambiguous

clean equidistant view

Clean view:

perspective-distorted ambiguous view

Ambiguous view:

Q:

Step 3 · Task 2 & 3

View Selection

ViewSel — one-stage

Directly pick the view that best answers the question.

candidate view 1
✓ View 1
candidate view 2 candidate view 3 candidate view 4 candidate view 5

AbstainViewSel — two-stage

Stage 1: abstain → "Cannot Determine"

ambiguous view

Stage 2: which view can answer this question?

resolving view
extra view extra view

What We Find

Frontier VLMs answer confidently but rarely recognize when they can't: accuracy on unanswerable cases collapses far below answerable accuracy, and viewpoint selection is often near random.

Model Occlusion Perspective Ambiguity Viewpoint
Ans.Unans.All Ans.Unans.All ViewSAbsViewS
Random 32.323.330.0 25.025.025.0 20.04.0
Open-source
Qwen2.5-VL-7B 51.139.348.0 62.441.557.8 24.68.6
Qwen2.5-VL-32B 51.740.048.6 69.021.758.5 20.74.6
InternVL3-38B 61.77.347.5 70.41.155.1 18.50.0
Closed-source
GPT-4o 53.932.848.4 35.236.335.4 39.322.1
GPT-5-mini 64.77.849.9 76.115.262.2 53.718.0
GPT-5.4 58.219.548.1 69.522.659.2 70.922.6
Gemini-2.5-Flash 56.145.053.2 66.42.452.2 18.56.7
Gemini-3.0-Flash 61.744.157.1 64.06.351.3 50.32.4

Table 1. Performance under occlusion and perspective ambiguity. Ans. = accuracy on answerable questions; Unans. = ability to correctly identify unanswerable cases; ViewS / AbsViewS = ViewSel and AbstainViewSel (viewpoint selection with and without the abstention stage). Bold = best in column.

Per-task accuracy across question types under occlusion (top) and perspective ambiguity (bottom), split by answerable / unanswerable conditions.
Figure 1. Model accuracy across question types under occlusion (top) and perspective ambiguity (bottom). Blue / orange backgrounds mark answerable vs. unanswerable conditions; dashed lines are the random baseline. Bold lines highlight the strongest closed-source models in each setting.

BibTeX

@article{zhang2025spatialuncertain,
  title   = {Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?},
  author  = {Zhang, Yue and Wang, Zun and Lin, Han and Bitton, Yonatan and Szpektor, Idan and Bansal, Mohit},
  journal = {arXiv preprint arXiv:2605.30557},
  year    = {2026},
}