LostBench — Ty Pham-Swann

Introduction

LostBench is built to assess and improve spatial-reasoning capabilities in frontier models by measuring performance on diverse, long-horizon spatial reasoning tasks. Each task has been completed by a human on the first attempt. The benchmark is far from saturated — the scores below are on the easiest tasks, in the easiest setting, at the shortest horizons. Powering the benchmark are scalable data pipelines that integrate open-source data and allow trivial creation of tasks with orders-of-magnitude longer time horizons. It's available as a Harbor-compatible repository at typhamswann/lostbench.

Get started

Run the Public Benchmark Play a task Read the Blog

Contact phamswannty@gmail.com for the full RL environment — 6,324 real-world Street-View tasks, many with time horizons and difficulty orders of magnitude beyond the public benchmark.

Leaderboard

Twelve-task validation subset of the 57-task public benchmark, default-mode, haversine scoring (closer to goal = higher). Each model received the same observation channel: a 1024 × 768 viewport image, one-line HUD, sliding-window image history of 4. Per-task turn budget = 3 × the computed optimal path. Models ran three times per task in their native harnesses; see cross-harness analysis below.

Mean path-progress across the 12-task subset (1.0 = arrived at the goal). Bars show the human ceiling and each model in their respective harness. Error bars are 95% confidence intervals from the three seeds.

Overview

While frontier LLMs have made significant progress in coding, reasoning, and complex tasks in recent years, visual reasoning capabilities still lack robustness compared to other domains. One especially challenging subset of visual reasoning for current models is spatial reasoning, which typically requires understanding many parts of an image at a given time and maintaining that understanding across multiple scenes or images.

People excel at these types of tasks and use interfaces like Street View, Autodesk, and Matterport in everyday work quite naturally.

Due to cost constraints, a representative subset of twelve benchmark tasks were run to show the performance graphed above. To fund a full assessment, access the full RL environment with thousands of long-horizon tasks, or hire me, contact phamswannty@gmail.com.

Custom interface with controllable primitives

LostBench varies task difficulty in two modes through the exposure of different primitives within the interface. By default, the agent has access to an image snapshot of their viewport (a still in the 360° panorama they are currently located in), an interactive map — which shows the starting point, endpoint, and their current position and direction — and a compass. In strict mode, the agent receives the same primitives except no compass, and the map lacks direction and location tracking — requiring higher levels of visual reasoning to use street signs and visual landmarks to orient itself.

An agent interacts with the world through tools to move the mouse (indicating an angle and magnitude), click the mouse up or down, and zoom in or out. These tools correspond with zooming and panning on the panoramic world and map, depending on which view is open at any given time. The agent can click a valid location on the road to move to that panorama. When confident they have reached the end position, they submit a guess.

Distance-based scoring

The agent can submit exactly one guess per task, which concludes the task. Their score is based on the haversine distance between their location and the endpoint:

pp = clip(1 − final_haversine_m / initial_haversine_m, 0, 1)

Higher scores mean the agent guessed when they were closer to the target. This continuous scoring allows greater granularity in assessing model capability than binary success.

Each task has been tested by a real human, who was able to complete each one efficiently on their first attempt. In 48 out of 57 found sessions, the human guessed exactly on the endpoint (0.0 meters from target), with the overall mean distance from target being 1.50 meters. For a person, it is trivial to guess the exact location once they can access it.

Novel tasks from a custom-built graph

Each task is 100% distinct. While the panorama and geospatial data used to build the worlds are open source, they are integrated in a way that prevents contamination. Each task world is consistent between runs, but uses data from many different panoramic captures algorithmically bridged to create a continuous and coherent world. Start and end locations are picked based on the desired task difficulty and time horizon, enabling significant scaling and uniqueness even within a single geographic location.

Tasks in the public benchmark set span 50 cities across 57 tasks, ensuring diverse scene characteristics. The private RL environment spans 1,122 cities across 6,324 tasks, and there is scant overlap between any given pair of tasks within the same city due to the variety of length and size of each task area.

Play a task

Try it yourself — this is the same task the models were given: navigate from the start to the goal, then submit when you think you've arrived. You get one attempt, and you're scored by how close you get. Click the street to walk (the further toward the horizon you click, the further you travel); drag to look around; scroll to zoom. Open the map to see the start (green), the goal (red), and the area you must stay inside — your location is the blue dot, and the compass shows which way you're facing.

Note that the UI is choppy in the simulation to emulate what the model sees in the environment, as opposed to the human advantage of seeing continuous frames.

Methodology

World Development

Maps are constructed by carefully stitching together 360° panorama imagery sourced via Mapillary with geospatial data from OpenStreetMap. Each task has its own playable area which is entirely or almost entirely traversable. A single task is a route from a starting point to an endpoint, with the optimal routes all guaranteed to be traversable.

In selecting map locations, density and continuity were prioritized. All map areas have over 90% surface coverage. Maps with key gaps between panorama locations were filtered out of the benchmark and environment set. Human verification of each map was done to ensure complete traversability.

Quality Assurance

Each map within the benchmark set was played once to confirm all tasks are completable. Prior to that, the map-creation process went through numerous tests and iterations to ensure each was fully traversable and that actions produced intuitive results. A number of key issues were identified and fixed.

Most notably were issues in traversal. Each map is a graph of discrete data (panoramic images), so ensuring connections spanned the entire map area was a significant challenge that required substantial work on identifying gaps and aligning points properly with road data. Intersections proved another key challenge, with alignment between what is visible within images and the road geography proving difficult. The current logic is still imperfect and somewhat rigid in creating bridges between nodes at intersections, resulting in some clicks that fail to route properly. Each intersection does have at least one valid bridge, though, which the human tester was able to find in practice.

The scoring system was also modified during QA: switched from Dijkstra path-distance to haversine distance due to the granularity of the in-game map and instructions provided to the user. When testing, guesses that appeared very close to the goal were scored poorly due to graph structure in a way judged overly penalizing given the task.

Harness Selection & Cost Limitations

The benchmark is structured as a Harbor-compatible repository with a provided simple custom completion loop harness. The benchmark results, however, tested models on their own native harnesses. This was done for fair testing following the analysis provided in the Robustness section.

The scores above represent performance on twelve select tasks with difficulty breakdown 4 easy / 4 medium / 4 hard. This was due to cost constraints; a run on the entire 57 tasks would provide a stronger baseline. Additionally, runs were only conducted in default mode, not strict. Small-scale tests indicate that frontier models remain incapable of completing all but the simplest tasks in strict mode. On all runs, turn budgets were set at 3 × the computed optimal path. Longer exploration may have increased performance, but cost constraints again proved a limiting factor.

Robustness

Cross-Harness Testing

Recent benchmark results have indicated that the harnesses used in assessment significantly impact model performance. As such, each model was tested on the simple completion loop harness created for the tasks and its native harness across three randomly selected tasks, two times each harness.

Mean path-progress for each model on its native harness versus the custom completion loop, over the three-task cross-harness slice. Solid bar = native harness; lighter bar = custom loop; whiskers are 95% confidence intervals (frontier n=12, open models n=6–15).

The results indicate that for LostBench tasks, frontier models are quite robust across harnesses. In contrast, the open source models appear to be quite fragile, with performance degrading substantially outside of their recommended scaffolding. Given the results of this analysis, models were run in the production benchmark within their own native harnesses. GLM was included in this analysis but ultimately struck from the full benchmarking due to endpoint instability.

Reasoning Level

Models were also tested at different reasoning levels twice for each harness across the same three tasks.

Mean path-progress at low versus high reasoning effort on the native harness, over the same three-task slice. Solid bar = low effort; lighter bar = high effort; whiskers are 95% confidence intervals (n=6). The intervals overlap heavily — the reasoning effect sits within the noise.

Reasoning levels did not appear to have a significant impact on performance in these tasks. While the differences were very minimal and below the noise threshold, all models did report higher scores on their native harnesses with mixed results on the default loop, potentially indicating that benefits of higher reasoning are tied to the harness the model was trained on. Further exploration on this question would be interesting.

Contamination Testing

Models were tested for contamination via a simple task where models were shown panoramas from different tasks and asked to identify the exact coordinates.

Model	Median error	Best guess	Within 50 km
GLM-5V-Turbo	15.8 km	3.21 km	67%
Claude Opus 4.8	19.0 km	1.16 km	67%
GPT-5.5	169 km	0.94 km	45%
Qwen3.7-Plus	219 km	1.80 km	50%

Geolocation error when each model is shown a clean panorama (no map, no coordinates) and asked for the exact coordinates, over 12 probe panoramas (GPT-5.5 returned 11). Median error is tens-to-hundreds of kilometres — city/region level at best, never the exact location. Gemini 3.1 Pro returned no successful geolocations and is omitted.

Results support other benchmarks finding that frontier models are adept at identifying general locations from photos. However, the imprecision of the guesses offers some evidence that the models were less likely to be trained on the exact panorama and location data found in the benchmark. More importantly, the task is fairly robust to contamination as models are required to navigate to reach a final location instead of simply submitting a guess, assessing spatial reasoning capabilities even if they've seen locations before.

Reproducibility

Every run — model, harness, provider route, reasoning level, and temperature — is recorded below, with the full per-rollout trajectory, transcript, and config captured in the run manifest. The provider route in particular differs by harness: the production benchmark drives each model through its native CLI on that vendor's own route, while the custom completion loop calls each provider's API directly (Anthropic via AWS Bedrock).

Production benchmark — native harnesses (12 tasks × 3 seeds)

Model	Harness	Provider route	Reasoning	Temperature
Claude Opus 4.8	Claude Code	Claude Code	high	1.0
Claude Opus 4.7	Claude Code	Claude Code	high	1.0
Claude Sonnet 4.6	Claude Code	Claude Code	high	1.0
GPT-5.5	Codex	Codex	high	1.0
Gemini 3.1 Pro	Antigravity	Antigravity	high	1.0
Gemini 3.5 Flash	Antigravity	Antigravity	high	1.0
Qwen3.7-Plus	Qwen Code	OpenRouter — `qwen/qwen3.7-plus`	n/a	1.0

Robustness studies — custom completion loop (cross-harness & reasoning, 3 tasks)

Model	Harness	Provider route	Reasoning	Temperature
Claude Opus 4.8	Custom completion loop	AWS Bedrock	low & high	1.0
Claude Opus 4.7	Custom completion loop	AWS Bedrock	low & high	1.0
Claude Sonnet 4.6	Custom completion loop	AWS Bedrock	low & high	1.0
GPT-5.5	Custom completion loop	OpenAI (direct)	low & high	1.0
Gemini 3.1 Pro	Custom completion loop	Google (Gemini API)	low & high	1.0
Gemini 3.5 Flash	Custom completion loop	Google (Gemini API)	low & high	1.0
Qwen3.7-Plus	Custom completion loop	OpenRouter — `qwen/qwen3.7-plus`	n/a	1.0
GLM-5V-Turbo	Custom completion loop	OpenRouter — `z-ai/glm-5v-turbo`	n/a	1.0

No run overrode temperature, so every run used the provider default of 1.0 (the value reasoning and extended-thinking modes require in any case). Reasoning is the effort level passed to the harness — high in production; Qwen Code exposes no reasoning knob — while the custom loop swept low and high for the reasoning study. GLM-5V-Turbo appears only in the studies; it was struck from the production benchmark for endpoint instability.

Full Benchmark Tasks Assessed (12): cand_0046_national_easy_02, cand_0060_national2_easy_01, cand_0118_national2_easy_02, cand_0233_national_easy_02, cand_0030_national2_medium_02, cand_0071_national_medium_01, cand_0182_national2_medium_01, cand_0279_national2_medium_02, cand_0046_national_hard_01, cand_0161_national2_hard_02, cand_0196_national_hard_02, cand_0233_national_hard_02

Robustness Testing Tasks Assessed (3): cell_new_00236_easy_02, cand_0030_national2_medium_02, cand_0196_national_hard_02

Contamination Testing Panoramas Assessed (12): cand_0030_national2_medium_02, cand_0071_national_medium_01, cand_0196_national_hard_02, cand_0279_national2_medium_02, cand_0362_national2_hard_02, cand_0465_national_medium_01, cand_0597_national2_easy_01, cand_0637_national2_easy_01, cell_new_00134_medium_02, cell_new_00198_medium_01, cell_new_00306_easy_02, cell_new_00357_medium_02

Manifest: manifest.json

Results

Mean path-progress across the 12-task subset (1.0 = arrived at the goal). Bars show the human ceiling and each model in their respective harness. Error bars are 95% confidence intervals from the three seeds.

Claude Opus 4.8 proved to be the most successful by far. As noted above, it was run using the Claude Code harness. In limited tests, all models failed to achieve any successful guesses in strict mode, which prevents use of the compass and hides the agent's location and direction on the map. Tasks in this setting, which humans are also able to solve in testing, are likely more difficult for frontier agents due to higher visual and spatial reasoning requirements. Agents in strict mode cannot track themselves on the map — a UI task that's likely well represented in their training distribution — and instead must find visual landmarks and street signs to orient.

Turn efficiency

Turns taken per task, as a multiple of the human tester. The dashed line marks human pace (1×). Turn counts span 0.7–2.0× the human's; the more accurate models are the more economical.

Turn counts ranged from 0.74× the human's (GPT-5.5) to 1.98× (Qwen3.7-Plus), averaging 1.24× across the agents. The pattern is that the more accurate models are also the more turn-economical: GPT-5.5 and Claude Opus 4.8 actually finished in fewer turns than the human tester (0.74× and 0.94×), while the weakest model wandered for nearly twice as many. That said, the lowest turn counts partly reflect earlier guessing rather than pure efficiency — the human spent more turns but closed to within a meter, whereas the faster models committed to a guess sooner and less accurately. On the hardest tasks several models were still guessing on one of their final allowed turns, frequently travelling far down a wrong path before correcting.

Discussion

Frontier models still have a significant way to go in spatial reasoning, where capabilities lag in comparison to other reasoning domains like coding. Claude Opus 4.8 using the Claude Code harness was the most accurate by far, but still wasn't perfectly accurate on the easier assessment set, and was meaningfully less efficient than the human tester. Further, isolated tests indicated that all harnesses struggled to complete tasks in strict mode, which requires greater spatial reasoning across longer time horizons.

It's clear the frontier models benefited significantly from more information on the map, which is likely closer aligned with UI visual reasoning tasks they're familiar with from training. The lags in efficiency and overall performance indicate that models still lack the visual intuition that's innate in people, and that there is significant progress still to be made in long-horizon visual reasoning.

Future Work

This project was significantly budget-constrained, serving as a barrier to running the full 57 tasks, running the benchmark in strict mode, and running it with higher turn limits (or without any cap). The full RL environment contains many more diverse tasks with higher difficulty and longer time horizons. The data pipeline allows trivially creating tasks with orders-of-magnitude longer time horizons. As frontier models progress, it will be interesting to track their progression on benchmark performance.

To fund a full assessment, access the full RL environment with thousands of long-horizon tasks, or hire me, contact phamswannty@gmail.com.