The pipeline
Recording
The browser’s
MediaRecorder API captures your selected tab as a WebM video stream (VP9 codec when available, with a fallback to baseline WebM). Recording is chunked every second and assembled into a single blob when you stop. The default auto-stop limit is 30 seconds, which is enough to capture most UI interaction sequences without producing a video that is expensive to process.The video is uploaded to Claude Scope for processing. The raw video file is used only for frame extraction and discarded immediately afterward — it is never stored permanently.Frame extraction
Rather than analyzing every video frame, Claude Scope uses SSIM-based frame differencing to extract only the frames where the UI meaningfully changed. SSIM (Structural Similarity Index Measure) compares the structural content of consecutive frames and discards frames that are too similar to the previous one.This keeps the number of frames small, which reduces Vision API cost and produces a cleaner, more readable timeline in the output prompt.
Vision lane
Each extracted frame is sent to Anthropic’s Vision AI (the model configured in your Model Access settings). For every frame, the Vision lane identifies:
- Buttons — interactive controls and their labels
- Inputs — text fields, checkboxes, selects, and their current state
- Headings — structural landmarks and page hierarchy
- Links — navigation targets and their text
- Other elements — any additional components the model identifies
The Vision lane requires a valid Anthropic API key with access to a vision-capable model. If the key is missing or all frames fail analysis, processing stops with an error. Configure your key in Model Access settings.
Playwright lane
Simultaneously with the Vision lane, a Playwright headless browser loads your seed URL and captures a full ARIA accessibility snapshot. This snapshot includes:
- Every interactive element by ARIA role (
button,textbox,link,checkbox, etc.) - Accessible names and labels
- Counts of each element type on the page
- The full accessible name tree for the loaded DOM
Synthesis
Once both lanes complete, the Synthesis stage merges the visual timeline and the Playwright accessibility snapshot into a single structured system prompt. The format of this prompt depends on the agent target you selected:
- Claude Code — Full system prompt with an inline ARIA tree, screenshot bundle references, and a visual state changelog
- Codex — Compact, diff-focused prompt optimized for GPT-4o completions
- Cursor — Formatted for the Cursor composer’s context window
- Raw — Unformatted merged output for use in any other tool
Output
The final prompt is stored alongside the session and displayed in the recording review view. It includes:
- A visual timeline summarizing UI changes frame-by-frame
- An inline ARIA tree from the Playwright snapshot
- A raw DOM diff comparing element counts between frames
- An optional screenshot bundle (base64-encoded frame thumbnails)
Vision lane vs. Playwright lane
The two analysis lanes are complementary, not redundant. Each contributes something the other cannot.| Vision lane | Playwright lane | |
|---|---|---|
| Data source | Pixel-level frames from your recording | Live DOM loaded in a headless browser |
| What it captures | Visual appearance, UI states over time, element labels as rendered | ARIA roles, accessible names, structural element counts |
| Temporal coverage | Every extracted frame across the full recording | Single snapshot of the seed URL at inspection time |
| Handles animations/transitions | Yes — captures intermediate states | No — snapshot is taken after page load |
| Requires API key | Yes (Anthropic) | No |
| Handles SPAs / dynamic content | Yes, if the recording covers those states | Partially — depends on what renders before snapshot |