The Work

The Pipeline

How it works: ingestion to response

Stage 1 - Ingestion

Baibelfish Document Engine

LOCAL 12 formats

Baibelfish is the document ingestion engine - the system that transforms raw content into a form the platform can retrieve from. It processes 12 input formats (PDF, HTML, DOCX, CSV, and others) using format-specific extraction pipelines. The critical design decision is content-aware chunking: rather than splitting documents by size or delimiter, Baibelfish analyses document structure, semantic density, and heading hierarchy to produce chunks that are optimised for retrieval accuracy rather than processing convenience.

After chunking, the engine runs entity extraction, cross-document relationship mapping, and local embedding generation - building both a vector index and a Dgraph-based knowledge graph. RAPTOR hierarchical summarisation creates multi-level abstractions that support both precise retrieval and broad thematic queries.

Benchmarked against

HotpotQA, SQuAD, Natural Questions, FinQA, MultiFieldQA - comparative evaluation vs LangChain, LlamaIndex, and Unstructured.io.

Baibelfish thesis

Stage 2 - Staging & Promotion

DeepThought Architecture

LOCAL Transactional

DeepThought introduces the staging-and-promotion paradigm to RAG ingestion. Rather than processing documents in a single undifferentiated pass, it moves content through defined stages - raw, parsed, chunked, enriched, validated, expert-promoted - with transactional transitions between each. Stage transitions are atomic: a document that fails validation stays at its current stage and does not corrupt the retrieval index. Rollback is possible. Audit trails are maintained.

Expert promotion identifies documents of particularly high retrieval value - primary sources, authoritative references, high-confidence extractions - and gives them elevated weight in the retrieval layer. The evaluation demonstrates measurable retrieval quality uplift from promotion, quantified across 35,000+ document sections.

DeepThought thesis

Stage 3 - Query Classification

7-Type Local Classifier

LOCAL 10-30ms

Every incoming query is classified before retrieval begins. The classifier - an ONNX-compiled neural network running on local hardware - assigns one of seven query types: factual, analytical, comparative, temporal, multi-hop, ambiguous, or out-of-scope. The classification governs downstream routing: which retrieval strategy is used, whether the knowledge graph is queried alongside the vector index, whether RAPTOR summaries are relevant, and what citation standard applies.

Classification takes 10-30ms and produces no API traffic. Out-of-scope queries are handled at this stage - returning a boundary response without an LLM call - which alone accounts for a significant fraction of the token savings on real-world workloads.

Stage 4 - Retrieval

Hybrid Vector + Knowledge Graph Search

LOCAL Dgraph

Retrieval combines dense vector search with traversal of a Dgraph-based knowledge graph. The vector index returns semantically similar chunks; the knowledge graph adds entity relationships, cross-document connections, and provenance chains that vector similarity alone cannot surface. For multi-hop queries - where the answer requires connecting information across multiple documents - graph traversal is essential.

Retrieved candidates are reranked using a local neural network before context assembly. The reranking model is domain-tuned per tenant where training data is available.

Stage 5 - Prompt Assembly

Semantic Compression & Context Assembly

LOCAL 16×-64× compression

Before any content reaches the LLM, it passes through a semantic compression stage. Retrieved chunks are compressed - removing redundancy, low-information passages, and content already represented in the existing context - at ratios between 16× and 64× depending on document type. Only the minimum context required to answer the query is assembled into the prompt.

The citation system assigns one of three citation types to each piece of retrieved content: direct quote, paraphrase with confidence score, or inferred relationship. These citations survive into the generated response - the LLM is given the citation structure, not asked to construct it from memory.

This stage is where the majority of the token savings materialise. On a standard RAG pipeline, the full retrieved context would go to the LLM unchanged. Here, only what is needed goes - and the structure of what is needed is explicitly defined.

Stage 6 - Conversation Context

Cognitive Memory & Character System

LOCAL Psychological models

The conversation context system manages memory across sessions using Ebbinghaus forgetting curves - a mathematical model of how human memory decays over time. Information from earlier in a conversation fades at a calibrated rate; corrections and high-confidence new information are retained at higher weight. The system does not have a fixed context window; it has a decaying memory with active management.

AI character state is modelled using the PAD emotional model (Pleasure, Arousal, Dominance) - a well-validated psychological framework for representing emotional state as a three-dimensional space. Emotional state shifts in response to conversational events and influences tone and response strategy. No biometrics are used; all emotional inference is from text.

Communication Accommodation Theory governs how the system adapts to individual users over time - adjusting register, vocabulary, and pace. This is believed to be the first production deployment of Communication Accommodation Theory in an AI system.

Stages 7-8 - Efficiency & Generation

The Efficiency Engine + LLM

Stage 7: LOCAL Stage 8: API

The Efficiency Engine is the cross-system layer that enforces token discipline across the entire pipeline. Before the assembled prompt reaches the LLM, the Efficiency Engine applies a final check: budget constraint, latency target, and carbon threshold. If the prompt exceeds defined limits, it triggers further compression or returns a structured partial response without an API call. This is the self-regulating health gating mechanism - the first known implementation in a production RAG system.

The Efficiency Engine also maintains the per-query carbon accounting framework. Using the Jegham et al. (arXiv:2505.09598) eco-efficiency benchmarks as a model-level foundation, it extends the measurement to the application layer - capturing the carbon cost of each query as a function of the processing path taken, not just the tokens sent to the model.

Stage 8 - generation - is the only stage that calls a cloud LLM. Everything before it has run locally. The LLM receives a semantically compressed prompt, a structured context block with citations, a character state injection, and a response format specification. It generates. The pipeline ends.

62%

Avg token reduction vs standard RAG

75%

Fewer cloud API calls

Per query

Carbon measurement at app layer

Efficiency Engine thesis

The Original Platform

In production since 1998 Information management platform Still running on the same architecture principles 27 years on

SiteEngine

Before the AI platform, there was SiteEngine - a declarative information management platform in continuous production since 1998. It was never built to be a CMS. It was built to manage any type of structured digital content the right way - with declarative page configuration, page inheritance, three-level processing and content-type abstraction all present before most of the frameworks that would later claim these patterns as their own.

The codebase has been entirely rewritten in Go, but the architectural decisions that governed SiteEngine in 1998 still govern it today. Still running on the same architecture principles 27 years on. The system works because it was built properly, not because it has been continuously patched.

The SiteEngine thesis documents this history - not as nostalgia, but because a platform that has operated without architectural replacement for 27 years is evidence that the original design decisions were correct.

Architecture

Declarative page configuration, page inheritance, three-level processing - all predating equivalent patterns in modern frameworks by approximately a decade.

Production record

Continuous operation since 1998 across multiple live sites. No architectural replacement. No framework migration. Same foundations, 27 years later.

Case studies

Management-Issues.com (7,200+ articles, 142 contributors, 25 years).

Twenty-five years in production

Management-Issues.com

The longest-running SiteEngine deployment is a live demonstration of what architectural durability looks like. Management-Issues, a management and leadership publication, has run continuously on SiteEngine since 2003 - 7,200+ articles, 142 contributors, and a 25-year archive without a single architectural migration. It celebrates its 25th year in 2026. Still running on the same architecture principles 27 years on.

7,200+

Articles

142

Columnists

25 yrs

Continuous operation

management-issues.com

SiteEngine thesis · Platform dashboard · management-issues.com

SiteEngine AI