Skip to main content
When you stop a recording, Claude Scope does more than save a video. It runs a multi-stage pipeline that extracts meaningful UI states, inspects your app’s accessibility structure, and merges everything into a single, structured system prompt. Understanding this pipeline helps you get better prompts and debug unexpected results.

The pipeline

1

Recording

The browser’s MediaRecorder API captures your selected tab as a WebM video stream (VP9 codec when available, with a fallback to baseline WebM). Recording is chunked every second and assembled into a single blob when you stop. The default auto-stop limit is 30 seconds, which is enough to capture most UI interaction sequences without producing a video that is expensive to process.The video is uploaded to Claude Scope for processing. The raw video file is used only for frame extraction and discarded immediately afterward — it is never stored permanently.
2

Frame extraction

Rather than analyzing every video frame, Claude Scope uses SSIM-based frame differencing to extract only the frames where the UI meaningfully changed. SSIM (Structural Similarity Index Measure) compares the structural content of consecutive frames and discards frames that are too similar to the previous one.This keeps the number of frames small, which reduces Vision API cost and produces a cleaner, more readable timeline in the output prompt.
3

Vision lane

Each extracted frame is sent to Anthropic’s Vision AI (the model configured in your Model Access settings). For every frame, the Vision lane identifies:
  • Buttons — interactive controls and their labels
  • Inputs — text fields, checkboxes, selects, and their current state
  • Headings — structural landmarks and page hierarchy
  • Links — navigation targets and their text
  • Other elements — any additional components the model identifies
The results are assembled into a visual timeline that describes how your UI changed across the recording.
The Vision lane requires a valid Anthropic API key with access to a vision-capable model. If the key is missing or all frames fail analysis, processing stops with an error. Configure your key in Model Access settings.
4

Playwright lane

Simultaneously with the Vision lane, a Playwright headless browser loads your seed URL and captures a full ARIA accessibility snapshot. This snapshot includes:
  • Every interactive element by ARIA role (button, textbox, link, checkbox, etc.)
  • Accessible names and labels
  • Counts of each element type on the page
  • The full accessible name tree for the loaded DOM
Unlike the Vision lane, which works from pixels, the Playwright lane works from the actual DOM. This gives Claude Scope ground truth about what elements exist, how they are named for assistive technology, and how they are structured — independent of how they look.
5

Synthesis

Once both lanes complete, the Synthesis stage merges the visual timeline and the Playwright accessibility snapshot into a single structured system prompt. The format of this prompt depends on the agent target you selected:
  • Claude Code — Full system prompt with an inline ARIA tree, screenshot bundle references, and a visual state changelog
  • Codex — Compact, diff-focused prompt optimized for GPT-4o completions
  • Cursor — Formatted for the Cursor composer’s context window
  • Raw — Unformatted merged output for use in any other tool
If you want a prompt for a different agent than you originally selected, you can change the target before copying. The analysis does not re-run — only the formatting changes.
6

Output

The final prompt is stored alongside the session and displayed in the recording review view. It includes:
  • A visual timeline summarizing UI changes frame-by-frame
  • An inline ARIA tree from the Playwright snapshot
  • A raw DOM diff comparing element counts between frames
  • An optional screenshot bundle (base64-encoded frame thumbnails)
Copy the prompt and paste it into your AI coding agent to start debugging.

Vision lane vs. Playwright lane

The two analysis lanes are complementary, not redundant. Each contributes something the other cannot.
Vision lanePlaywright lane
Data sourcePixel-level frames from your recordingLive DOM loaded in a headless browser
What it capturesVisual appearance, UI states over time, element labels as renderedARIA roles, accessible names, structural element counts
Temporal coverageEvery extracted frame across the full recordingSingle snapshot of the seed URL at inspection time
Handles animations/transitionsYes — captures intermediate statesNo — snapshot is taken after page load
Requires API keyYes (Anthropic)No
Handles SPAs / dynamic contentYes, if the recording covers those statesPartially — depends on what renders before snapshot
Both lanes are required for processing. If either lane fails, the pipeline stops and reports an error attributed to the failing lane.

How frames are stored

After synthesis, each extracted frame is saved alongside its Vision analysis results and a diff summary. The diff summary counts elements added and removed relative to the previous frame, which is how the timeline shows you what changed at each step. The original video file is deleted from temporary storage after extraction. Only the extracted frames (as base64 PNG thumbnails) and their analysis metadata are stored.