Every time we update SiteEngine AI and re-run the benchmarks, we publish a new dated version. Historical versions stay available so anyone can trace the trajectory of the numbers as the system evolves.
Three benchmarking tracks are operational: RAG retrieval against FinQA, LegalBench, NarrativeQA and QASPER; memory against the Agent Memory Benchmark (PersonaMem, LoComo); and production robustness measured on real workloads.
"The metrics cited were gathered in production, against real workloads, with real users. The benchmarks are external and reproducible. The claims are empirically supported or explicitly qualified."
- Seamus Waldron, SedaSoft Ltd.
The 8 April 2026 release is the first fully public, three-track benchmark run. All three tracks use the production code path with no benchmark-specific accommodations.
4-layer scoring framework. Token F1 vs LLM-judge critique empirically reproduced on live responses.
Two full Agent Memory Benchmark datasets. At or above the Supermemory leader range on LoComo.
247 production-length turns. Zero errors. Local voice path below the "feels instant" perceptual threshold.
Each entry links to a full, self-contained report for that release. Earlier versions are not overwritten. When the numbers change, the change is visible on the record - not rewritten on top of it.
RAG latency 5.3-7.9× speedup; LoComo memory 79.2% (at/above Supermemory leader range); production stress 0 errors over 247 turns, p95 9.9 s. Three full tracks published together for the first time.
Next release
New versions will be published here each time SiteEngine AI is updated and the benchmarks are re-run. Previous versions will remain at their original dated URLs.
These benchmarks were not written to support a funding round or to create the impression of academic credibility. They were written because we won't start to build something without understanding it thoroughly - and writing a proper account of it is part of that process.
The metrics cited were gathered in production, against real workloads, with real users. The benchmarks are external and reproducible. The claims are empirically supported or explicitly qualified.
If you are a researcher, an academic institution, or an organisation working on adjacent problems and you find this work interesting, we would welcome a conversation.
Versioning policy
We are happy to share methodology, raw run data, and the benchmark adapters. All code, benchmark adapters and thesis appendices are reproducible.
Start a conversation