Methods & architecture

What's actually running under the hood — the retrieval pipeline, the exact parameters, the live index numbers, and the literature each technique comes from. Every figure below is computed from the running index, not asserted.

Hybrid vector + keyword + graph RAG · provenance-aware

1. Ingest & chunk

Outlook threads, MeetGeek-transcribed Teams calls, and SharePoint PDFs/drawings are normalized and chunked into the retrieval corpus.

2. Embed (vector lane)

Hashed bag-of-terms embedding into a fixed-dimensional space with an L2-normalized TF vector and a domain synonym map, scored by cosine similarity.

3. Lexical (keyword lane)

Okapi BM25 with IDF weighting and document-length normalization over the same corpus.

4. Graph expansion (graph lane)

1-hop traversal across the knowledge graph: source ↔ project ↔ requirement ↔ conflict ↔ commitment ↔ person, plus supersession edges.

5. Reciprocal Rank Fusion

The three lane rankings are fused with RRF so a result strong in any lane rises — robust to any single lane being noisy.

6. Temporal provenance re-rank

Fused scores are multiplied by a recency decay and a supersession penalty so currently-governing evidence beats stale/superseded sources.

7. Grounded answer or abstain

Answers cite only governing evidence; below an evidence-coverage threshold the model abstains rather than guess (selective prediction).

Scoring functions

BM25(q,d)   = Σ idf(q)·(tf·(k1+1)) / (tf + k1·(1−b + b·|d|/avgdl))
              k1 = 1.5,  b = 0.75

cosine(q,d) = q·d / (‖q‖‖d‖)        dim = 96

RRF(d)      = Σ_lanes 1 / (k + rank_lane)
              k = 60

recency(d)  = max(0.45, e^(−Δdays/30))
              half-life ≈ 21d,  floor 0.45

final(d)    = RRF × (0.55 + 0.45·recency) × supersededPenalty
              supersededPenalty = 0.34

Lanes: 3Synonym clusters: 11Synonym terms: 65Stopwords: 54

Live index

Corpus chunks indexed

2,164

Tokens indexed

620

Vocabulary (unique terms)

Avg chunk length (tokens)

Embedding dimensions

Knowledge-graph nodes

Knowledge-graph edges

Supersession edges

Evidence spans

Requirements tracked

Conflicts modeled

20 / 23

Projects / people

Truth numbers — grounding behaviour

How disciplined the answers are, measured against the graph.

100%

Grounding rate

requirements with a cited source

Governing sources

currently authoritative

Superseded

demoted by provenance

14%

Provenance coverage

corpus in a supersession chain

79%

Avg evidence conf.

extraction confidence

≤ 4

Citations / answer

governing only · abstains if unsupported

Techniques & literature

Retrieval-Augmented Generation

Lewis et al., 2020 — RAG for knowledge-intensive NLP

Okapi BM25

Robertson & Zaragoza, 2009 — The Probabilistic Relevance Framework

Reciprocal Rank Fusion

Cormack, Clarke & Büttcher, 2009 — outperforming Condorcet & individual rank learning

Dense Passage Retrieval

Karpukhin et al., 2020 — dense bi-encoder retrieval

Approximate NN (HNSW)

Malkov & Yashunin, 2018 — hierarchical navigable small-world graphs

Graph RAG

Microsoft Research, 2024 — graph-structured retrieval for global questions

Selective prediction / abstention

El-Yaniv & Wiener, 2010 — the reject option

Provenance-aware retrieval

temporal supersession re-ranking over an evidence graph

Implementation is a deterministic, backend-free reference model of the above for demo fidelity; the production target swaps the hashed embedding for a dense bi-encoder + ANN index and the in-memory graph for a property-graph store, preserving the same fusion and provenance-aware re-ranking contract.

See it run on a live query →