> ## Documentation Index
> Fetch the complete documentation index at: https://docs.bryel.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Evals

> Run your agent against a benchmark suite and have a judge score every output, then compare versions on a leaderboard.

An eval suite is a set of tasks (**cases**), each with binary, weighted
**criteria**. You run one or more model/agent versions against the suite; for
each run your harness ships what the agent produced — a screenshot and/or text —
and a judge scores it against the criteria. A **leaderboard** compares versions
by case and overall.

The engine is domain-agnostic: the suite owns the judging policy and criteria, so
the same machinery grades a design-critique bench, a code-correctness bench, or
anything else. Exporting artifacts is `@bryel/evals` — framework-agnostic, zero
dependencies.

<CodeGroup>
  ```bash bun theme={"theme":{"light":"github-light","dark":"github-dark"}}
  bun add @bryel/evals
  ```

  ```bash npm theme={"theme":{"light":"github-light","dark":"github-dark"}}
  npm i @bryel/evals
  ```

  ```bash pnpm theme={"theme":{"light":"github-light","dark":"github-dark"}}
  pnpm add @bryel/evals
  ```

  ```bash yarn theme={"theme":{"light":"github-light","dark":"github-dark"}}
  yarn add @bryel/evals
  ```
</CodeGroup>

<Note>Also re-exported from [`@bryel/browser`](/sdk/browser) and `@bryel/vercel`, and available in [Python](/sdk/python) as `bryel.start_eval_session`.</Note>

## The flow

One eval task = one `sessionId`. You start the runs, your harness drives the
agent under each session and ships the result, and the judge does the rest.

<Steps>
  <Step title="Create the runs">
    Kick off a suite against the versions you want to compare — from the
    dashboard's **Evals** page, or over MCP with `bryel_start_eval`. This creates
    one run per case × model and returns each run's `sessionId` and the case
    `prompt`.
  </Step>

  <Step title="Run your agent, per session">
    For each run, drive **your** agent on the case `prompt`, tracing under the
    run's `sessionId`. The platform never runs your agent — it only defines what
    to evaluate.
  </Step>

  <Step title="Export what it produced">
    When the agent finishes, capture a screenshot and/or its text output and ship
    it. The run flips to *judging*.

    ```ts theme={"theme":{"light":"github-light","dark":"github-dark"}}
    import { startEvalSession } from "@bryel/evals";

    const session = startEvalSession(sessionId, { apiKey: "bkp_…" });
    // …run your agent, capture a screenshot…
    await session.export({
      images: [screenshotBase64], // base64 (raw or a data: URL)
      outputTexts: [finalAnswer], // optional
    });
    ```
  </Step>

  <Step title="Read the leaderboard">
    The judge scores each run's artifact against the suite's criteria and writes
    per-criterion pass/fail + reasoning. Compare versions by case and overall on
    the **Evals** page, or with `bryel_eval_results`.
  </Step>
</Steps>

<Note>The run must already exist for a `sessionId` (created in step 1); artifacts **attach** to a run — they never create one.</Note>

## Putting it together

A harness that runs a whole suite:

```ts theme={"theme":{"light":"github-light","dark":"github-dark"}}
import { startEvalSession } from "@bryel/evals";

// runs[] comes from bryel_start_eval / your API — each has { sessionId, prompt }
for (const r of runs) {
  const session = startEvalSession(r.sessionId, { apiKey: "bkp_…" });
  const out = await runYourAgent(r.prompt, { sessionId: r.sessionId }); // your agent
  await session.export({ images: [out.screenshotBase64], outputTexts: [out.text] });
}
```

## The judge

Each artifact is graded by a multimodal model (the suite picks it) against the
case's criteria. A criterion is binary with a **signed weight** — positive means
desired, negative means harmful (it passes when the harmful thing is *absent*).
The score is the earned points over the sum of positive weights. The judge
returns a pass/fail and a one-sentence, artifact-grounded reason per criterion,
so a run detail shows exactly why it scored the way it did.

Images are optional — a text-only suite grades `outputTexts` alone. A suite that
judges visuals can require an image, in which case a run with no screenshot fails
rather than being graded blind.

## API

### `exportEvalSession(options): Promise<{ runId, images }>`

Ship an eval session's artifacts. Attaches them to the run and queues the judge.

### `startEvalSession(sessionId, options): EvalSession`

Bind `(apiKey, sessionId)` and return a handle whose `export(artifacts)` calls
`exportEvalSession` — the ergonomic form.

### Options

<ParamField path="apiKey" type="string" required>An ingest key — a **publishable** `bkp_…` in the browser (write-only, origin-locked) or any ingest key server-side.</ParamField>
<ParamField path="sessionId" type="string" required>The eval task's session id — must match a run you created.</ParamField>
<ParamField path="images" type="string[]">Screenshots as base64 (raw or a `data:` URL). The app encodes; the SDK ships bytes.</ParamField>
<ParamField path="outputTexts" type="string[]">Text outputs the agent produced.</ParamField>
<ParamField path="endpoint" type="string" default="https://ingest.eu.bryel.ai/v1/evals/artifacts">Ingest endpoint. Override for self-hosting.</ParamField>

<Warning>In the browser, use a publishable `bkp_` key — never a secret `bk_` key.</Warning>
