Skip to main content
An eval suite is a set of tasks (cases), each with binary, weighted criteria. You run one or more model/agent versions against the suite; for each run your harness ships what the agent produced — a screenshot and/or text — and a judge scores it against the criteria. A leaderboard compares versions by case and overall. The engine is domain-agnostic: the suite owns the judging policy and criteria, so the same machinery grades a design-critique bench, a code-correctness bench, or anything else. Exporting artifacts is @bryel/evals — framework-agnostic, zero dependencies.
bun add @bryel/evals
Also re-exported from @bryel/browser and @bryel/vercel, and available in Python as bryel.start_eval_session.

The flow

One eval task = one sessionId. You start the runs, your harness drives the agent under each session and ships the result, and the judge does the rest.
1

Create the runs

Kick off a suite against the versions you want to compare — from the dashboard’s Evals page, or over MCP with bryel_start_eval. This creates one run per case × model and returns each run’s sessionId and the case prompt.
2

Run your agent, per session

For each run, drive your agent on the case prompt, tracing under the run’s sessionId. The platform never runs your agent — it only defines what to evaluate.
3

Export what it produced

When the agent finishes, capture a screenshot and/or its text output and ship it. The run flips to judging.
import { startEvalSession } from "@bryel/evals";

const session = startEvalSession(sessionId, { apiKey: "bkp_…" });
// …run your agent, capture a screenshot…
await session.export({
  images: [screenshotBase64], // base64 (raw or a data: URL)
  outputTexts: [finalAnswer], // optional
});
4

Read the leaderboard

The judge scores each run’s artifact against the suite’s criteria and writes per-criterion pass/fail + reasoning. Compare versions by case and overall on the Evals page, or with bryel_eval_results.
The run must already exist for a sessionId (created in step 1); artifacts attach to a run — they never create one.

Putting it together

A harness that runs a whole suite:
import { startEvalSession } from "@bryel/evals";

// runs[] comes from bryel_start_eval / your API — each has { sessionId, prompt }
for (const r of runs) {
  const session = startEvalSession(r.sessionId, { apiKey: "bkp_…" });
  const out = await runYourAgent(r.prompt, { sessionId: r.sessionId }); // your agent
  await session.export({ images: [out.screenshotBase64], outputTexts: [out.text] });
}

The judge

Each artifact is graded by a multimodal model (the suite picks it) against the case’s criteria. A criterion is binary with a signed weight — positive means desired, negative means harmful (it passes when the harmful thing is absent). The score is the earned points over the sum of positive weights. The judge returns a pass/fail and a one-sentence, artifact-grounded reason per criterion, so a run detail shows exactly why it scored the way it did. Images are optional — a text-only suite grades outputTexts alone. A suite that judges visuals can require an image, in which case a run with no screenshot fails rather than being graded blind.

API

exportEvalSession(options): Promise<{ runId, images }>

Ship an eval session’s artifacts. Attaches them to the run and queues the judge.

startEvalSession(sessionId, options): EvalSession

Bind (apiKey, sessionId) and return a handle whose export(artifacts) calls exportEvalSession — the ergonomic form.

Options

apiKey
string
required
An ingest key — a publishable bkp_… in the browser (write-only, origin-locked) or any ingest key server-side.
sessionId
string
required
The eval task’s session id — must match a run you created.
images
string[]
Screenshots as base64 (raw or a data: URL). The app encodes; the SDK ships bytes.
outputTexts
string[]
Text outputs the agent produced.
endpoint
string
default:"https://ingest.eu.bryel.ai/v1/evals/artifacts"
Ingest endpoint. Override for self-hosting.
In the browser, use a publishable bkp_ key — never a secret bk_ key.