SedaSoft Ltd. · Versioned & dated · Reproducible

Benchmarks

Every time we update SiteEngine AI and re-run the benchmarks, we publish a new dated version. Historical versions stay available so anyone can trace the trajectory of the numbers as the system evolves.

Three benchmarking tracks are operational: RAG retrieval against FinQA, LegalBench, NarrativeQA and QASPER; memory against the Agent Memory Benchmark (PersonaMem, LoComo); and production robustness measured on real workloads.

"The metrics cited were gathered in production, against real workloads, with real users. The benchmarks are external and reproducible. The claims are empirically supported or explicitly qualified."

- Seamus Waldron, SedaSoft Ltd.

Latest release Version 2026-04-08

Three tracks. One system.

The 8 April 2026 release is the first fully public, three-track benchmark run. All three tracks use the production code path with no benchmark-specific accommodations.

Track 1

RAG retrieval

4-layer scoring framework. Token F1 vs LLM-judge critique empirically reproduced on live responses.

Pipeline speedup5.3-7.9×

Token F1 vs LLM-judge5% vs 64%

Datasets4 validated

Track 2

Memory (AMB)

Two full Agent Memory Benchmark datasets. At or above the Supermemory leader range on LoComo.

LoComo (1,540q)79.2%

Temporal subset87.9%

PersonaMem (589q)47.4%

Track 3

Production robustness

247 production-length turns. Zero errors. Local voice path below the "feels instant" perceptual threshold.

Error rate0 / 247

p95 latency9.9 s

Local voice TTFB341 ms

Read the 2026-04-08 release in full

Version history

Every release. Dated. Preserved.

Each entry links to a full, self-contained report for that release. Earlier versions are not overwritten. When the numbers change, the change is visible on the record - not rewritten on top of it.

Latest v2026-04-08 8 April 2026

First full three-track release

RAG latency 5.3-7.9× speedup; LoComo memory 79.2% (at/above Supermemory leader range); production stress 0 errors over 247 turns, p95 9.9 s. Three full tracks published together for the first time.

Read report

Next release

New versions will be published here each time SiteEngine AI is updated and the benchmarks are re-run. Previous versions will remain at their original dated URLs.

About this work

These benchmarks were not written to support a funding round or to create the impression of academic credibility. They were written because we won't start to build something without understanding it thoroughly - and writing a proper account of it is part of that process.

The metrics cited were gathered in production, against real workloads, with real users. The benchmarks are external and reproducible. The claims are empirically supported or explicitly qualified.

If you are a researcher, an academic institution, or an organisation working on adjacent problems and you find this work interesting, we would welcome a conversation.

Versioning policy

Every release is dated
Historical versions stay live
No silent overwrites
External, reproducible datasets where available

Want to dig into the numbers?

We are happy to share methodology, raw run data, and the benchmark adapters. All code, benchmark adapters and thesis appendices are reproducible.

Start a conversation