Part 1: Foundational Concepts - RAG Assignment Solutions

Question 1: Define RAG and describe how it improves generative model responses compared to an LLM without retrieval

Definition of RAG

Retrieval-Augmented Generation (RAG) is an AI architecture that enhances Large Language Models (LLMs) by combining them with an external knowledge retrieval system. Instead of relying solely on the knowledge encoded in the model’s parameters during training, RAG dynamically retrieves relevant information from external sources (documents, databases, knowledge bases) and injects this context into the generation process.

Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        RAG Pipeline                             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  User Query ──► Retriever ──► Relevant Documents ──► LLM ──► Answer
β”‚                    β”‚                                  β–²         β”‚
β”‚                    β–Ό                                  β”‚         β”‚
β”‚             Knowledge Base          Context Injection β”‚         β”‚
β”‚            (Vector Store)                             β”‚         β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

How RAG Improves Responses Compared to Plain LLMs

AspectPlain LLMRAG-Enhanced LLM
Knowledge SourceStatic, frozen at training timeDynamic, can access up-to-date information
Factual AccuracyMay generate plausible but incorrect facts (β€œhallucinations”)Grounds responses in retrieved evidence
Domain SpecificityLimited to general training dataCan access specialized/proprietary knowledge bases
VerifiabilityCannot cite sourcesCan reference specific documents
Update MechanismRequires expensive retrainingSimply update the knowledge base
CostNeeds larger models for more knowledgeKnowledge scales with database, not model size

Key Improvements

  1. Knowledge Grounding: RAG grounds the LLM’s responses in actual retrieved documents, ensuring answers are based on real information rather than statistical patterns in training data.

  2. Reduced Hallucinations: By providing explicit context, the model is constrained to generate responses consistent with the retrieved information, significantly reducing the likelihood of fabricated facts.

  3. Up-to-Date Information: The knowledge base can be continuously updated without retraining the model, allowing the system to provide current information about recent events, new research, or changing data.

  4. Domain Expertise: Organizations can build RAG systems using their proprietary documents (manuals, research papers, internal wikis), enabling the LLM to answer questions about specialized domains it was never explicitly trained on.

  5. Transparency and Trust: RAG systems can provide citations and source documents, allowing users to verify the information and building trust in the system’s responses.


Question 2: List and briefly explain the core stages of a RAG pipeline (indexing, retrieval, augmentation, generation)

The Four Core Stages

                    INDEXING (Offline)
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  Documents β†’ Chunking β†’ Embedding β†’ Storage   β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                           β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚              Vector Database                  β”‚
    β”‚         (Chroma, Pinecone, FAISS)             β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                           β–Ό
                    RETRIEVAL (Online)
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  Query β†’ Embed β†’ Similarity Search β†’ Top-K    β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                           β–Ό
                   AUGMENTATION (Online)
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  Retrieved Docs + Query β†’ Prompt Construction β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                           β–Ό
                   GENERATION (Online)
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚      Augmented Prompt β†’ LLM β†’ Final Answer    β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Stage 1: Indexing

Purpose: Prepare documents for efficient semantic search.

Process:

  1. Document Collection: Gather raw text from various sources (PDFs, web pages, databases, documents)
  2. Text Chunking: Split documents into smaller, manageable pieces (typically 100-1000 tokens) because:
    • LLMs have context length limits
    • Smaller chunks enable more precise retrieval
    • Overlapping chunks preserve context across boundaries
  3. Embedding Generation: Convert each chunk into a dense vector representation using an embedding model (e.g., sentence-transformers/all-MiniLM-L6-v2)
  4. Vector Storage: Store embeddings in a vector database (Chroma, FAISS, Pinecone) with metadata for efficient similarity search

Example:

## Chunking
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
chunks = splitter.split_documents(documents)
 
## Embedding + Storage
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = Chroma.from_documents(chunks, embeddings)

Stage 2: Retrieval

Purpose: Find the most relevant document chunks for a given query.

Process:

  1. Query Embedding: Convert the user’s question into a vector using the same embedding model
  2. Similarity Search: Compare the query vector against all stored document vectors
  3. Top-K Selection: Return the K most similar chunks (typically K=3-10)
  4. Ranking: Optionally re-rank results using more sophisticated methods

Similarity Metrics:

  • Cosine Similarity: Measures angle between vectors (most common)
  • Euclidean Distance: Measures direct distance between vectors
  • Dot Product: Combines magnitude and direction

Example:

retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
relevant_docs = retriever.invoke("What is RAG?")

Stage 3: Augmentation

Purpose: Construct a context-enriched prompt for the LLM.

Process:

  1. Context Formatting: Combine retrieved chunks into a coherent context block
  2. Prompt Construction: Create a structured prompt that includes:
    • System instructions (role, constraints)
    • Retrieved context (evidence to use)
    • User question (what to answer)
  3. Context Ordering: Optionally order chunks by relevance or recency

Prompt Design Considerations:

  • Clearly separate context from question
  • Instruct model to use ONLY provided context
  • Include fallback instructions (β€œIf not in context, say β€˜I don’t know’”)

Example:

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)
 
prompt = f"""You are an assistant that answers questions using ONLY the provided context.
 
Context:
{format_docs(retrieved_docs)}
 
Question: {user_question}
 
Answer:"""

Stage 4: Generation

Purpose: Produce the final answer using the augmented prompt.

Process:

  1. LLM Invocation: Send the augmented prompt to the language model
  2. Response Generation: Model generates answer grounded in the provided context
  3. Output Parsing: Extract and format the response
  4. Optional Post-Processing: Validate, filter, or enhance the response

Key Parameters:

  • Temperature: Controls randomness (0 for deterministic, >0 for creativity)
  • Max Tokens: Limits response length
  • Top-P / Top-K: Controls token sampling

Example:

llm = ChatGroq(model="llama-3.1-8b-instant", temperature=0)
response = llm.invoke(augmented_prompt)
answer = response.content

Question 3: Explain why RAG can reduce β€œhallucinations” compared to plain generative models β€” and why it doesn’t eliminate them entirely

What Are Hallucinations?

Hallucinations in LLMs refer to generated content that is:

  • Factually incorrect
  • Fabricated (doesn’t exist in reality)
  • Inconsistent with reliable sources
  • Plausible-sounding but false

Example: An LLM might confidently state that β€œAlbert Einstein invented the telephone in 1895” – this sounds plausible but is completely false.

Why RAG Reduces Hallucinations

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    HALLUCINATION REDUCTION                      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  Plain LLM:          Question ──────────────► Answer            β”‚
β”‚                                               (Unanchored)      β”‚
β”‚                                                                 β”‚
β”‚  RAG:                Question ──► Context ──► Answer            β”‚
β”‚                                     β–²         (Grounded)        β”‚
β”‚                                     β”‚                           β”‚
β”‚                              Evidence Base                      β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Mechanisms that Reduce Hallucinations:

  1. Explicit Evidence Grounding

    • The LLM is provided with specific text excerpts to reference
    • Answers are constrained by the retrieved content
    • Model has less β€œfreedom” to fabricate
  2. Prompt Instructions

    • System prompts explicitly instruct: β€œAnswer ONLY based on the provided context”
    • Fallback instructions: β€œIf the information is not in the context, say β€˜I don’t know’”
    • These constraints behaviorally limit hallucination
  3. Reduced Reliance on Parametric Knowledge

    • Plain LLMs rely entirely on knowledge encoded in weights
    • RAG shifts reliance to external, verifiable sources
    • External sources can be audited and updated
  4. Source Attribution

    • RAG can cite which documents informed the answer
    • Users can verify claims against source material
    • Creates accountability and transparency
  5. Current Information

    • Training data has a cutoff date; RAG can access recent information
    • Reduces errors from outdated knowledge

Why RAG Doesn’t Eliminate Hallucinations Entirely

Despite its benefits, RAG is not a complete solution:

LimitationExplanation
Retrieval FailuresIf relevant documents aren’t retrieved, the model may still hallucinate to fill gaps
Context Window LimitsOnly K documents can be included; important information may be excluded
Irrelevant RetrievalSemantic similarity doesn’t guarantee relevance; retrieved docs may be off-topic
Chunk Boundary IssuesImportant information split across chunks may lose context
Model BehaviorLLMs may still override context with parametric knowledge, especially if confident
Context IgnoringModels sometimes ignore or misinterpret provided context, especially in long prompts
Synthesis ErrorsWhen combining information from multiple sources, the model may create incorrect syntheses
Outdated Knowledge BaseIf the vector store contains outdated information, it will propagate to answers
Adversarial ContentMalicious or incorrect documents in the knowledge base will be retrieved and used

Mitigation Strategies

  1. Improve Retrieval Quality

    • Use hybrid retrieval (BM25 + dense embeddings)
    • Implement re-ranking with cross-encoders
    • Increase K for broader context
  2. Better Prompting

    • Explicit uncertainty instructions
    • Request confidence levels
    • Chain-of-thought reasoning
  3. Advanced RAG Techniques

    • RRR-RAG: Rewrite queries for better retrieval
    • Self-RAG: Iteratively refine answers
    • CRAG: Critique and revise generated answers
  4. Knowledge Base Curation

    • Regular updates and validation
    • Source quality control
    • Deduplication and conflict resolution

Question 4: Compare two retrieval methods (BM25 vs Dense Embeddings) and discuss how they affect RAG performance

Overview of Retrieval Methods

AspectBM25 (Sparse/Lexical)Dense Embeddings (Semantic)
TypeStatistical/Keyword-basedNeural network-based
RepresentationSparse vectors (term frequencies)Dense vectors (learned representations)
MatchingExact/partial word matchesSemantic meaning similarity
VocabularyDepends on exact termsUnderstands synonyms/paraphrases
ComputationFast, lightweightRequires embedding model
Index SizeSmaller (inverted index)Larger (all document vectors)

BM25 (Best Matching 25)

Algorithm Overview: BM25 is a probabilistic retrieval function that ranks documents based on term frequency (TF) and inverse document frequency (IDF).

BM25 Score = Ξ£ IDF(qi) Γ— [f(qi,D) Γ— (k1 + 1)] / [f(qi,D) + k1 Γ— (1 - b + b Γ— |D|/avgdl)]

Where:
- qi = query terms
- f(qi,D) = term frequency in document
- |D| = document length
- avgdl = average document length
- k1, b = tuning parameters

Strengths:

  • βœ… Exact Match Precision: Excellent when exact terminology matters (legal documents, code, technical specs)
  • βœ… No Training Required: Works out-of-the-box without neural networks
  • βœ… Fast and Scalable: Inverted index enables sub-second queries over millions of documents
  • βœ… Interpretable: Easy to understand why documents were retrieved
  • βœ… Handles Rare Terms Well: Specific/rare terms get high IDF weight

Weaknesses:

  • ❌ Vocabulary Mismatch: β€œautomobile” won’t match β€œcar”
  • ❌ No Semantic Understanding: β€œbank” (financial) matches β€œriver bank”
  • ❌ Sensitive to Query Formulation: Requires users to guess document terminology
  • ❌ No Cross-Lingual Support: Only works within one language

Example:

from langchain_community.retrievers import BM25Retriever
 
bm25_retriever = BM25Retriever.from_documents(documents)
results = bm25_retriever.invoke("What is RAG?")

Dense Embeddings

Algorithm Overview: Dense embeddings use neural networks (e.g., Sentence Transformers) to encode text into fixed-dimensional vector spaces where semantic similarity corresponds to vector proximity.

                    Sentence Transformer
Query: "How does RAG work?"  ────────►  [0.23, -0.15, 0.87, ...]
                                              β”‚
                                              β–Ό Cosine Similarity
Document: "RAG retrieves..."  ────────►  [0.21, -0.18, 0.85, ...]

Strengths:

  • βœ… Semantic Understanding: Understands synonyms, paraphrases, related concepts
  • βœ… Query Flexibility: Users can phrase questions naturally
  • βœ… Cross-Lingual: Multilingual models can match across languages
  • βœ… Concept Matching: β€œmachine learning” matches β€œartificial intelligence”
  • βœ… Dense Representation: Captures nuanced meaning

Weaknesses:

  • ❌ Computational Cost: Requires GPU for embedding generation
  • ❌ Model Dependency: Quality depends on embedding model choice
  • ❌ Rare Terms: May miss highly specific or technical terms
  • ❌ Index Size: Dense vectors require more storage
  • ❌ False Positives: May retrieve semantically similar but irrelevant content

Example:

from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
 
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = Chroma.from_documents(documents, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
results = retriever.invoke("How does RAG work?")

Impact on RAG Performance

ScenarioBetter MethodReason
Technical documentation (exact terms)BM25Precise terminology matching
Customer support (varied phrasing)Dense EmbeddingsHandles question variations
Legal/medical domainsHybridPrecision AND semantic understanding
Code searchBM25Exact syntax matching
Conceptual questionsDense EmbeddingsSemantic relationship capture
Low-resource deploymentBM25No GPU required
Multilingual applicationsDense EmbeddingsCross-lingual capabilities

Hybrid Retrieval: Best of Both Worlds

Modern RAG systems often combine both methods:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    HYBRID RETRIEVAL                             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  Query ──┬──► BM25 Retriever ──────► Top-K₁ Documents           β”‚
β”‚          β”‚                                  β”‚                   β”‚
β”‚          β”‚                                  β–Ό                   β”‚
β”‚          β”‚                           Reciprocal Rank            β”‚
β”‚          β”‚                              Fusion                  β”‚
β”‚          β”‚                                  β–²                   β”‚
β”‚          β”‚                                  β”‚                   β”‚
β”‚          └──► Dense Retriever ─────► Top-Kβ‚‚ Documents           β”‚
β”‚                                                                 β”‚
β”‚                      Final Ranked Documents                     β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Fusion Methods:

  1. Reciprocal Rank Fusion (RRF): Combines rankings from multiple retrievers
  2. Weighted Combination: Assign weights to each method’s scores
  3. Re-ranking: Use a cross-encoder to re-rank combined results

Example:

from langchain.retrievers import EnsembleRetriever
 
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, dense_retriever],
    weights=[0.4, 0.6]  # 40% BM25, 60% Dense
)

Recommendations for RAG System Design

  1. Start with Dense Embeddings: Better default for most QA use cases
  2. Add BM25 for Precision: When exact terminology matters
  3. Use Hybrid for Production: Combines strengths of both
  4. Tune K Carefully: More documents = more context but also more noise
  5. Consider Re-ranking: Cross-encoders can significantly improve relevance
  6. Evaluate on Your Data: Performance varies by domain and query types

These answers provide foundational understanding of RAG systems as required for Assignment 2.