> ## Documentation Index > Fetch the complete documentation index at: https://docs.bryel.ai/llms.txt > Use this file to discover all available pages before exploring further. # Evals > Run your agent against a benchmark suite and have a judge score every output, then compare versions on a leaderboard. An eval suite is a set of tasks (**cases**), each with binary, weighted **criteria**. You run one or more model/agent versions against the suite; for each run your harness ships what the agent produced — a screenshot and/or text — and a judge scores it against the criteria. A **leaderboard** compares versions by case and overall. The engine is domain-agnostic: the suite owns the judging policy and criteria, so the same machinery grades a design-critique bench, a code-correctness bench, or anything else. Exporting artifacts is `@bryel/evals` — framework-agnostic, zero dependencies. ```bash bun theme={"theme":{"light":"github-light","dark":"github-dark"}} bun add @bryel/evals ``` ```bash npm theme={"theme":{"light":"github-light","dark":"github-dark"}} npm i @bryel/evals ``` ```bash pnpm theme={"theme":{"light":"github-light","dark":"github-dark"}} pnpm add @bryel/evals ``` ```bash yarn theme={"theme":{"light":"github-light","dark":"github-dark"}} yarn add @bryel/evals ``` Also re-exported from [`@bryel/browser`](/sdk/browser) and `@bryel/vercel`, and available in [Python](/sdk/python) as `bryel.start_eval_session`. ## The flow One eval task = one `sessionId`. You start the runs, your harness drives the agent under each session and ships the result, and the judge does the rest. Kick off a suite against the versions you want to compare — from the dashboard's **Evals** page, or over MCP with `bryel_start_eval`. This creates one run per case × model and returns each run's `sessionId` and the case `prompt`. For each run, drive **your** agent on the case `prompt`, tracing under the run's `sessionId`. The platform never runs your agent — it only defines what to evaluate. When the agent finishes, capture a screenshot and/or its text output and ship it. The run flips to *judging*. ```ts theme={"theme":{"light":"github-light","dark":"github-dark"}} import { startEvalSession } from "@bryel/evals"; const session = startEvalSession(sessionId, { apiKey: "bkp_…" }); // …run your agent, capture a screenshot… await session.export({ images: [screenshotBase64], // base64 (raw or a data: URL) outputTexts: [finalAnswer], // optional }); ``` The judge scores each run's artifact against the suite's criteria and writes per-criterion pass/fail + reasoning. Compare versions by case and overall on the **Evals** page, or with `bryel_eval_results`. The run must already exist for a `sessionId` (created in step 1); artifacts **attach** to a run — they never create one. ## Putting it together A harness that runs a whole suite: ```ts theme={"theme":{"light":"github-light","dark":"github-dark"}} import { startEvalSession } from "@bryel/evals"; // runs[] comes from bryel_start_eval / your API — each has { sessionId, prompt } for (const r of runs) { const session = startEvalSession(r.sessionId, { apiKey: "bkp_…" }); const out = await runYourAgent(r.prompt, { sessionId: r.sessionId }); // your agent await session.export({ images: [out.screenshotBase64], outputTexts: [out.text] }); } ``` ## The judge Each artifact is graded by a multimodal model (the suite picks it) against the case's criteria. A criterion is binary with a **signed weight** — positive means desired, negative means harmful (it passes when the harmful thing is *absent*). The score is the earned points over the sum of positive weights. The judge returns a pass/fail and a one-sentence, artifact-grounded reason per criterion, so a run detail shows exactly why it scored the way it did. Images are optional — a text-only suite grades `outputTexts` alone. A suite that judges visuals can require an image, in which case a run with no screenshot fails rather than being graded blind. ## API ### `exportEvalSession(options): Promise<{ runId, images }>` Ship an eval session's artifacts. Attaches them to the run and queues the judge. ### `startEvalSession(sessionId, options): EvalSession` Bind `(apiKey, sessionId)` and return a handle whose `export(artifacts)` calls `exportEvalSession` — the ergonomic form. ### Options An ingest key — a **publishable** `bkp_…` in the browser (write-only, origin-locked) or any ingest key server-side. The eval task's session id — must match a run you created. Screenshots as base64 (raw or a `data:` URL). The app encodes; the SDK ships bytes. Text outputs the agent produced. Ingest endpoint. Override for self-hosting. In the browser, use a publishable `bkp_` key — never a secret `bk_` key.