SedaSoft Ltd. · Versioned & dated · Reproducible

Benchmarks

Every time we update SiteEngine AI and re-run the benchmarks, we publish a new dated version. Historical versions stay available so anyone can trace the trajectory of the numbers as the system evolves.

Three benchmarking tracks are operational: RAG retrieval against FinQA, LegalBench, NarrativeQA and QASPER; memory against the Agent Memory Benchmark (PersonaMem, LoComo); and production robustness measured on real workloads.

"The metrics cited were gathered in production, against real workloads, with real users. The benchmarks are external and reproducible. The claims are empirically supported or explicitly qualified."

- Seamus Waldron, SedaSoft Ltd.

Latest release Version 2026-04-08

Three tracks. One system.

The 8 April 2026 release is the first fully public, three-track benchmark run. All three tracks use the production code path with no benchmark-specific accommodations.

Track 1

RAG retrieval

4-layer scoring framework. Token F1 vs LLM-judge critique empirically reproduced on live responses.

Pipeline speedup5.3-7.9×
Token F1 vs LLM-judge5% vs 64%
Datasets4 validated
Track 2

Memory (AMB)

Two full Agent Memory Benchmark datasets. At or above the Supermemory leader range on LoComo.

LoComo (1,540q)79.2%
Temporal subset87.9%
PersonaMem (589q)47.4%
Track 3

Production robustness

247 production-length turns. Zero errors. Local voice path below the "feels instant" perceptual threshold.

Error rate0 / 247
p95 latency9.9 s
Local voice TTFB341 ms
Read the 2026-04-08 release in full
Version history

Every release. Dated. Preserved.

Each entry links to a full, self-contained report for that release. Earlier versions are not overwritten. When the numbers change, the change is visible on the record - not rewritten on top of it.

About this work

These benchmarks were not written to support a funding round or to create the impression of academic credibility. They were written because we won't start to build something without understanding it thoroughly - and writing a proper account of it is part of that process.

The metrics cited were gathered in production, against real workloads, with real users. The benchmarks are external and reproducible. The claims are empirically supported or explicitly qualified.

If you are a researcher, an academic institution, or an organisation working on adjacent problems and you find this work interesting, we would welcome a conversation.

Versioning policy

  • Every release is dated
  • Historical versions stay live
  • No silent overwrites
  • External, reproducible datasets where available

Want to dig into the numbers?

We are happy to share methodology, raw run data, and the benchmark adapters. All code, benchmark adapters and thesis appendices are reproducible.

Start a conversation