Evals - bryel

An eval suite is a set of tasks (cases), each with binary, weighted criteria. You run one or more model/agent versions against the suite; for each run your harness ships what the agent produced — a screenshot and/or text — and a judge scores it against the criteria. A leaderboard compares versions by case and overall. The engine is domain-agnostic: the suite owns the judging policy and criteria, so the same machinery grades a design-critique bench, a code-correctness bench, or anything else. Exporting artifacts is @bryel/evals — framework-agnostic, zero dependencies.

bun add @bryel/evals

Also re-exported from @bryel/browser and @bryel/vercel, and available in Python as bryel.start_eval_session.

The flow

One eval task = one sessionId. You start the runs, your harness drives the agent under each session and ships the result, and the judge does the rest.

Create the runs

Kick off a suite against the versions you want to compare — from the dashboard’s Evals page, or over MCP with bryel_start_eval. This creates one run per case × model and returns each run’s sessionId and the case prompt.

Run your agent, per session

For each run, drive your agent on the case prompt, tracing under the run’s sessionId. The platform never runs your agent — it only defines what to evaluate.

Export what it produced

When the agent finishes, capture a screenshot and/or its text output and ship it. The run flips to judging.

import { startEvalSession } from "@bryel/evals";

const session = startEvalSession(sessionId, { apiKey: "bkp_…" });
// …run your agent, capture a screenshot…
await session.export({
  images: [screenshotBase64], // base64 (raw or a data: URL)
  outputTexts: [finalAnswer], // optional
});

Read the leaderboard

The judge scores each run’s artifact against the suite’s criteria and writes per-criterion pass/fail + reasoning. Compare versions by case and overall on the Evals page, or with bryel_eval_results.

The run must already exist for a sessionId (created in step 1); artifacts attach to a run — they never create one.

Putting it together

A harness that runs a whole suite:

import { startEvalSession } from "@bryel/evals";

// runs[] comes from bryel_start_eval / your API — each has { sessionId, prompt }
for (const r of runs) {
  const session = startEvalSession(r.sessionId, { apiKey: "bkp_…" });
  const out = await runYourAgent(r.prompt, { sessionId: r.sessionId }); // your agent
  await session.export({ images: [out.screenshotBase64], outputTexts: [out.text] });
}

The judge

Each artifact is graded by a multimodal model (the suite picks it) against the case’s criteria. A criterion is binary with a signed weight — positive means desired, negative means harmful (it passes when the harmful thing is absent). The score is the earned points over the sum of positive weights. The judge returns a pass/fail and a one-sentence, artifact-grounded reason per criterion, so a run detail shows exactly why it scored the way it did. Images are optional — a text-only suite grades outputTexts alone. A suite that judges visuals can require an image, in which case a run with no screenshot fails rather than being graded blind.

API

`exportEvalSession(options): Promise<{ runId, images }>`

Ship an eval session’s artifacts. Attaches them to the run and queues the judge.

`startEvalSession(sessionId, options): EvalSession`

Bind (apiKey, sessionId) and return a handle whose export(artifacts) calls exportEvalSession — the ergonomic form.

Options

apiKey

string

required

An ingest key — a publishable bkp_… in the browser (write-only, origin-locked) or any ingest key server-side.

sessionId

string

required

The eval task’s session id — must match a run you created.

images

string[]

Screenshots as base64 (raw or a data: URL). The app encodes; the SDK ships bytes.

outputTexts

string[]

Text outputs the agent produced.

endpoint

string

default:"https://ingest.eu.bryel.ai/v1/evals/artifacts"

Ingest endpoint. Override for self-hosting.

In the browser, use a publishable bkp_ key — never a secret bk_ key.

​The flow

​Putting it together

​The judge

​API

​exportEvalSession(options): Promise<{ runId, images }>

​startEvalSession(sessionId, options): EvalSession

​Options

The flow

Putting it together

The judge

API

`exportEvalSession(options): Promise<{ runId, images }>`

`startEvalSession(sessionId, options): EvalSession`

Options