Evaluations

Memory systems should be evaluated by the behavior they improve. Memory Layer includes a repeatable evaluation harness for testing whether memory changes agent outcomes, retrieval quality, cost, and latency.

What the eval harness protects against

  • Overclaiming from a demo.
  • Confusing retrieval success with autonomous coding success.
  • Ignoring token and latency cost.
  • Treating stale or wrong memories as harmless.

Run an evaluation

Always dry-run first, then run the real suite only after reviewing scripts and fixtures.

# dry run
memory eval run --suite evals/examples/memory-smoke \
  --condition full-memory --profile offline --dry-run

# paired run: no-memory vs full-memory
memory eval run --suite evals/suites/memory-improvement-v1 \
  --condition no-memory --condition full-memory --allow-shell --repeat 5

# compare
memory eval compare \
  --baseline  'target/memory-evals/*no-memory*.json' \
  --candidate 'target/memory-evals/*full-memory*.json' --text

Use --allow-shell only after reviewing suite scripts and fixtures. Shell-executing evals are code execution inputs, not passive data files.

Suite structure

An eval suite is a directory with:

FilePurpose
suite.tomlSuite name, project, fixture, default profile, repeat count, label status, and minimum item count.
items.jsonlOne evaluation item per line.
optional scripts/Commands used by command or agent-build items.
optional fixtures/Source workspaces, seeded data, or static inputs.

Common item types:

TypeMeasures
retrieval_qaWhether expected memories are returned.
grounded_answerWhether answers include required facts and avoid forbidden claims.
resume_qualityWhether briefings cover expected context.
command_taskWhether a command succeeds.
agent_build_taskWhether an agent completes a fixture task under controlled conditions.
agent_build_sequenceWhether an agent maintains continuity across ordered work on one workspace.

External retrievers

Plug in your own retrieval backend for comparison:

memory eval run --suite evals/suites/memory-improvement-v1 \
  --condition full-memory --retriever-cmd './my-retriever' --allow-shell

Ablation tests

Compare no-memory and memory-enabled variants item by item. Pair variants on the same suite, commit, and model to isolate what memory contributes.

Metrics

MetricMeaning
Success rateWhether the task met its expected outcome.
Recall@KWhether relevant items appear in the top K results.
MRRHow early the first relevant result appears.
nDCGWhether useful results rank near the top.
Assertion recallWhether expected factual assertions were recovered.
Token costModel context or generation cost used by a run.
LatencyHow long retrieval, answer generation, or eval work took.

Metric improvement is evidence for a bounded claim about the suite, model, and configuration used. It is not universal proof that every future agent task will improve.

Reproducibility

Tie evaluation claims to artifacts, suite version, commit, model/provider, and configuration. Keep raw JSON outputs under target/memory-evals/ and compare item-level results before making claims.

Current benchmark claim

The latest checked-in Memory improvement report is intentionally bounded. It shows strong improvement for hidden memory-only retrieval and grounded-answer tasks, plus token reduction in that suite. It does not claim universal autonomous coding improvement for every future repository or model.

When publishing a claim, include:

  • suite name and version
  • git commit
  • model and provider
  • conditions compared
  • repeat count
  • item types
  • success, retrieval, token, and latency deltas
  • known shortcomings

Common mistakes

MistakeWhy it weakens the result
Running only one conditionThere is no baseline.
Using one lucky repeatLLM variance is hidden.
Ignoring token costMemory may improve quality while increasing cost, or the reverse.
Publishing unreviewed labelsThe scoring target may be wrong.
Allowing shell without reviewThe suite can execute arbitrary code.
Treating retrieval metrics as coding proofFinding a memory is not the same as completing a software task.

Next

Read CLI eval reference, How it works, or Operations.

© 2026 Olivier Van Acker (3vilM33pl3). Memory Layer is AGPL-3.0-or-later with commercial licensing available.

On this page