Retrieval-Augmented Generation (RAG) & Knowledge Systems

I. Static Knowledge vs. Dynamic Grounding

Large Language Models suffer from temporal cutoff—their internal knowledge is frozen at the time of training. Lewis et al. (2020) introduced Retrieval-Augmented Generation (RAG) to solve this by providing the model with a "contextual window" to the external world.

In this paradigm, the LLM is not treated as a database, but as a reasoning engine. We retrieve relevant documents from a private or live dataset and inject them into the prompt, drastically reducing hallucinations and enabling the use of proprietary data.

The Hallucination Barrier Hallucinations often occur when a model attempts to bridge gaps in its parametric memory using statistical plausibility. RAG replaces "guessing" with "reading".

Data Freshness

Linking models to real-time APIs and document stores without re-training.

II. Vector Semantics & Embedding Pipelines

To retrieve information, we must first translate human language into a form computers can compare: **Embeddings**. An embedding is a high-dimensional vector that captures the semantic essence of a text chunk.

Modern pipelines use models like BGE or Ada to project text into a 1536-dimensional space. In this space, the distance between two vectors (often measured via Cosine Similarity) corresponds to their conceptual similarity, allowing us to find "the answer" even if the keywords don't match exactly.

$$ \text{Similarity}(\mathbf{q}, \mathbf{d}) = \frac{\mathbf{q} \cdot \mathbf{d}}{\|\mathbf{q}\| \|\mathbf{d}\|} $$

Semantic Neighborhood Explorer

Visualize how the Query vector (blue) retrieves the most semantically related documents (green spots) while ignoring noise.

Search Radius ($ k $)

Found: 0 clusters

III. Vector Indexing: HNSW & IVFFlat

Searching millions of vectors individually (Flat search) is too slow for real-time AI. We use **Approximate Nearest Neighbor (ANN)** algorithms like HNSW (Hierarchical Navigable Small Worlds) to find the "closest" matches in logarithmic time.

However, semantic search often fails on specific technical terms or IDs. To solve this, we use Hybrid Search, which combines vector results with traditional **BM25** keyword matching. Techniques like HyDE (Hypothetical Document Embeddings) further improve accuracy by generating a fake "perfect answer" and using that to search the database.

HNSW (Malkov & Yashunin, 2018): Hierarchical Navigable Small Worlds works like a probabilistic skip-list. It constructs a multi-layered graph where the top layers are sparse (long-range highways) and lower layers are dense (local streets).

Voronoi Partitions (IVF) In Inverted File Index (IVF), the vector space is partitioned into Voronoi cells using K-Means. Search is restricted to the cell closest to the query, reducing valid candidates from millions to thousands.

$$ \text{Time}_{HNSW} \\approx O(\\log N) $$ $$ \text{Recall}@K = \\frac{\text{Relevant} \\cap \text{Retrieved}}{\text{Relevant}} $$

IV. Cross-Encoders & Neural Re-ranking

Initial retrieval (Bi-Encoders) is fast but imprecise. **Cross-Encoders** (Re-rankers) perform a deeper semantic analysis on the Top-K results to ensure the most relevant context is passed to the LLM.

**Reciprocal Rank Fusion (RRF)** is the standard algorithm for merging results from Keyword Search (BM25) and Vector Search (Dense). It penalizes documents that appear low in the ranking while boosting those that appear in both lists.

$$ \text{RRF}(d) = \sum_{j \in \text{Sources}} \frac{1}{k + r_j(d)} $$

Where $ r_j(d) $ is the rank of document $ d $ in system $ j $, and $ k $ is a smoothing constant (typically 60).

V. The RAG Pipeline Hierarchy

Naive RAG: Simple fetch-and-stuff.
Corrective RAG: Using a 'referee' to validate retrieved snippets.
Graph RAG: Traversing Knowledge Graphs to find multi-hop relationships.

VI. System Latency & Optimization

In production RAG systems, Latency determines usability. A user expects a response in under 2 seconds. The retrieval step often introduces a bottleneck causing high Time-To-First-Token (TTFT).

TTFT vs. Total Time

TTFT (Time to First Token) measures the perceived responsiveness. Total Time measures throughput. Retrieval adds a fixed latency penalty to TTFT that cannot be streamed.

Optimization Strategies

Quantization: Using int8 or binary quantization for vectors reduces memory bandwidth usage by 4x-32x.
Parallel Calls: Executing vector search and keyword search (BM25) concurrently before re-ranking.

VII. RAG Evaluation (RAGAS)

We cannot rely on "vibes" to judge RAG systems. RAGAS (Es et al., 2023) defines component-wise metrics using LLMs as judges (Reference-Free Evaluation).

Faithfulness

Measures if the generated answer is derived purely from the retrieved context (no hallucinations).

$$ S_{faith} = \frac{|Claims_{retrieved}|}{|Claims_{total}|} $$

Answer Relevance

Measures if the answer actually addresses the query, regardless of factual truth.

$$ S_{rel} = \text{CosSim}(\text{Emb}(Q), \text{Emb}(A_{gen})) $$

Context Precision & Recall
Precision: Are the retrieved documents relevant? (Signal-to-Noise Ratio).
Recall: Did we retrieve all the necessary information to answer the question?

VIII. GraphRAG & Knowledge Graphs

Vector search finds semantic similarity but fails on structural reasoning (e.g., "How did X affect Y via Z?"). GraphRAG combines unstructured vectors with structured Knowledge Graphs (KG).

Traversal: We map entities to nodes and relationships to edges: (Drug)-[INTERACTS_WITH]->(Protein)-[CAUSES]->(SideEffect). The LLM generates Cypher/Gremlin queries to traverse these paths, enabling multi-hop reasoning that vector search misses.

IX. Sparse Retrieval (SPLADE)

Dense vectors (Embeddings) are great for meaning but bad at exact keyword matching (e.g., part numbers). SPLADE (Sparse Lexical and Expansion Model) learns a sparse representation where the dimensions correspond to vocabulary tokens.

Unlike BM25 (TF-IDF), SPLADE performs Query Expansion, activating terms like "car" even if the input is "automobile", combining the precision of keyword search with the recall of semantic search.

Primary Sources & Further Reading

RAG Foundations

Lewis et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.
Gao et al. (2022). Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE).
Pinecone Engineering. Vector Databases: A Beginner's Guide.

Vector Search & Indices

Malkov & Yashunin (2018). Efficient and Robust Approximate Nearest Neighbor Search using HNSW.
Nomic AI (2024). Matryoshka Embeddings: Dynamic Dimensions.
Hugging Face (2024). LangChain and LlamaIndex Architecture Guides.