@bryel/evals — framework-agnostic, zero
dependencies.
Also re-exported from
@bryel/browser and @bryel/vercel, and available in Python as bryel.start_eval_session.The flow
One eval task = onesessionId. You start the runs, your harness drives the
agent under each session and ships the result, and the judge does the rest.
Create the runs
Kick off a suite against the versions you want to compare — from the
dashboard’s Evals page, or over MCP with
bryel_start_eval. This creates
one run per case × model and returns each run’s sessionId and the case
prompt.Run your agent, per session
For each run, drive your agent on the case
prompt, tracing under the
run’s sessionId. The platform never runs your agent — it only defines what
to evaluate.Export what it produced
When the agent finishes, capture a screenshot and/or its text output and ship
it. The run flips to judging.
The run must already exist for a
sessionId (created in step 1); artifacts attach to a run — they never create one.Putting it together
A harness that runs a whole suite:The judge
Each artifact is graded by a multimodal model (the suite picks it) against the case’s criteria. A criterion is binary with a signed weight — positive means desired, negative means harmful (it passes when the harmful thing is absent). The score is the earned points over the sum of positive weights. The judge returns a pass/fail and a one-sentence, artifact-grounded reason per criterion, so a run detail shows exactly why it scored the way it did. Images are optional — a text-only suite gradesoutputTexts alone. A suite that
judges visuals can require an image, in which case a run with no screenshot fails
rather than being graded blind.
API
exportEvalSession(options): Promise<{ runId, images }>
Ship an eval session’s artifacts. Attaches them to the run and queues the judge.
startEvalSession(sessionId, options): EvalSession
Bind (apiKey, sessionId) and return a handle whose export(artifacts) calls
exportEvalSession — the ergonomic form.
Options
An ingest key — a publishable
bkp_… in the browser (write-only, origin-locked) or any ingest key server-side.The eval task’s session id — must match a run you created.
Screenshots as base64 (raw or a
data: URL). The app encodes; the SDK ships bytes.Text outputs the agent produced.
Ingest endpoint. Override for self-hosting.