Post-training a 27B model to beat frontier models on scientific data tasks

Scientific Data Tasks

CactusBench is built to assess the ability of frontier models to complete complex quality-assurance tasks with a very high precision bar on real scientific data collected by biologists at Saguaro National Park. These tasks involve complex visual reasoning, accurate OCR of messy handwriting, analyzing field photographs, and identifying mistakes and anomalies in the data, in order to carry out the work of a biologist maintaining the real cactus dataset.

Specifically, a model must accurately transcribe measurements from two handwritten datasheets, then reason about how the arms should be mapped across years using their measurements and identify any anomalous measurement errors and correct them.

0 / 10

Methods

Due to the shape and complexity of the task, naive fine-tuning and RL efforts failed to improve performance. CactusBench tasks are limited by both perception (in this case, the ability to read what a field biologist wrote down) and visual reasoning (the ability to match data across years based on measurements). In practice, training these together would degrade performance, so instead an approach was taken to isolate each capability by training in two stages:

Start Qwen3.6-27B open base model

Stage 1 · RLVR Perception isolated transcription, per-cell reward (GRPO)

Stage 2 · SFT Reasoning & format verified agentic trajectories: transcribe‑then‑match

Each capability is trained in isolation: RLVR sharpens perception on a pure transcription task, then SFT installs the in-loop transcribe-then-reason behavior that elicits it on the agentic task.

To target perception, Qwen3.6-27B was trained with RLVR on a separately constructed task involving just reading and transcribing measurement sheets. Then, to elicit this perception boost within the task and improve reasoning, the RL'd model was fine-tuned on a custom-built dataset of real output trajectories sampled from a deterministically assembled base with corrections to reasoning and answers by a frontier model. Together, this produced a model that was better at both visual perception and visual reasoning within the agentic workflow on CactusBench tasks.

Improving perception with RLVR

The first stage of the post-training process involved improving the model's visual perception, where most of the errors on the evaluation originated from, via RLVR on a separate transcription task. The dataset was 90 tasks assembled by taking the existing sheets and ground truth and asking the model to only transcribe the data sheet (no arm matching or error correction like in the original task). This was in a simple transcription prompt instead of within the agentic harness and trained via GRPO, with per-cell accuracy used as the reward. A total of 200 steps with eight rollouts per task were run with tasks with low variance rollouts filtered out.

Training vastly improved the model's visual perception on the isolated transcription task but didn't yield gains on overall performance on the agentic task. This was somewhat expected given the differences in the harness and added complexity of the multi-turn rollouts of CactusBench tasks. A closer analysis on another RL'd model on the tasks gives greater insight into how perception performance gains were lost as the task and harness were incrementally transformed into the full agentic version. This analysis ultimately motivated the subsequent SFT rollout in order to best elicit the model's improved perception capabilities from the RL run.

Supervised Fine-Tuning

From the above analysis, it seemed clear that separating perception from reasoning would aid in reducing transcription errors by allowing the model to transcribe early in the context window where the task was most similar to the pure transcription task it was trained on rather than throughout the rollout. This would also potentially allow reasoning to be more coherent by letting the model trust the initial transcript and focus on the arm matching task later in the context window of the agentic loop. As such, the goals of the supervised fine-tuning were to teach the model the behaviors of transcribing directly after viewing images of the sheets and then doing a separate explicit reasoning step to match arms between years and identify errors.

Where the perception gain leaks away. The same RLVR’d Qwen3.5-4B reads a clean single sheet at 0.994, but accuracy falls as each piece of the agentic format is added — and most of the loss comes from the tool/output format (−0.060) and the multi-turn loop (−0.113), not from reading itself. Isolating perception inside the real prompt is what the SFT sets out to fix.

To carry out these goals, a custom-built dataset was created that demonstrated the desired behaviors. First, a base prompt was built containing the system prompt and a template built from typical rollout behavior of the first few actions where the two datasheets were viewed. A response was appended with the ground truth directly after each tool-call to view each datasheet. Additionally, a line was appended to the system prompt telling it to explicitly reason about the arm mapping.

SYSTEMYou are an autonomous agent working inside a Unix workspace at /workspace/. You complete the task in the first user message by calling the provided tools. Issue exactly ONE tool call per turn. Inspect the datasheets and photos, then call write_submission once with the final cleaned table as a JSON array string. After you have viewed and transcribed both datasheets, reason explicitly about how each sheet’s paper arm numbers map to the canonical numbers: the 2023 numbers are canonical; match each 2026 arm to the 2023 arm at the same bearing; any arm with no prior-year match is appended as new.steering line appended to the system promptUSER# Curate this saguaro — two biologists measured it (2023 and 2026). Produce one cleaned spreadsheet: arms matched across years, canonical arm numbers, every measurement re-keyed into the canonical schema…ASSISTANT<function=list_dir> path: /workspace/datasheetssheet_A.png sheet_B.pngASSISTANT<function=view_image> path: /workspace/datasheets/sheet_A.pngImage /workspace/datasheets/sheet_A.png (1280×989): [image]ASSISTANTI have viewed datasheets/sheet_A.png. Transcribing its arm rows top-to-bottom, exactly as written — paper arm #, direction, then A–E — before I reason:
[{"arm":"1","direction":40,"A":2.26,"B":0.98,"C":3.13,"D":0.99,"E":0.6}, {"arm":"2","direction":90,"A":2.24,"B":0.98,"C":2.95,"D":1.02,"E":0.5}, {"arm":"3","direction":180,"A":2.26,"B":0.96,"C":2.82,"D":0.98,"E":0.5}, {"arm":"4","direction":330,"A":2.21,"B":0.93,"C":3.24,"D":0.915,"E":0.6}]ground-truth transcription appended right after the view<function=view_image> path: /workspace/datasheets/sheet_B.pngImage /workspace/datasheets/sheet_B.png (1280×989): [image]ASSISTANTI have viewed datasheets/sheet_B.png. Transcribing its arm rows top-to-bottom, exactly as written — paper arm #, direction, then A–E — before I reason:
[{"arm":"1","direction":50,"A":2.28,"B":1.07,"C":3.41,"D":1.07,"E":0.6}, {"arm":"2","direction":77,"A":2.23,"B":1.04,"C":3.12,"D":1.09,"E":0.6}, {"arm":"3","direction":182,"A":2.25,"B":1.03,"C":2.97,"D":1.04,"E":0.5}, {"arm":"4","direction":336,"A":2.33,"B":0.97,"C":3.44,"D":1,"E":0.6}]appended transcription<function=list_dir> path: /workspace/photosASSISTANTNow reasoning about the arm matching. 2023 (sheet_A): arm 1 40°, arm 2 90°, arm 3 180°, arm 4 330°. 2026 (sheet_B): arm 1 50°, arm 2 77°, arm 3 182°, arm 4 336°.
Matching by bearing: 2026 arm 1 (50°) → 2023 arm 1 (40°) = canonical 1; arm 2 (77°) → (90°) = 2; arm 3 (182°) → (180°) = 3; arm 4 (336°) → (330°) = 4. All four match — no new arms.<function=write_submission> [{"saguaro_id":"15-101","year":2023,"arm":"1","direction":40,"A":2.26,"B":0.98,"C":3.13,"D":0.99,"E":0.6,"note":""}, … 8 rows …]

One SFT target, built from a real rollout (saguaro 15-101). Highlighted spans are injected: the steering line, and the correct per-arm transcription appended right after each view_image. The model learns to transcribe each sheet, then reason the cross-year match, under the real prompt.

The model was then run within the normal agentic harness from this base across multiple rollouts. If the task had a successful rollout, it was saved as-is to the fine-tuning set. If it had errors, GPT-5.5 was provided the full rollout and ground truth and was prompted to respond with a corrected rollout with modified responses and reasoning traces to add to the set. The model was then trained on the 90-task set across three epochs.

Harness

The harness used was a modified version of the standard CactusBench completion loop. It was specifically adapted to handle failures related to how the Qwen-3.6 family of models handles tool calling. Frontier baselines were established by running them in their respective native harnesses. For more information, read CactusBench.

Results

Qwen3.6-27B (this work) frontier

Mean per-cell table accuracy on the held-out set (6 tasks × 3 rollouts; notes excluded). The post-trained open 27B (RLVR + SFT) lands second, ahead of Gemini 3.1 Pro and both Claude Opus models, behind only GPT-5.5. X-axis starts at 0.90 to make the tight top-cluster differences legible. Small-sample set — differences within the top cluster are not statistically separated.

The RLVR + SFT model performed better than both the base Qwen3.6-27B and a number of frontier models on the holdout set (six tasks, three rollouts each). Analysis of the error types across the models show that the SFT run lowered both perception errors and arm matching (reasoning) errors significantly, even though it was only targeting reasoning. Essentially, the format change was required to unlock the gains in perception gained from reinforcement learning.