Part 1: Foundational Concepts - RAG Assignment Solutions
Question 1: Define RAG and describe how it improves generative model responses compared to an LLM without retrieval
Definition of RAG
Retrieval-Augmented Generation (RAG) is an AI architecture that enhances Large Language Models (LLMs) by combining them with an external knowledge retrieval system. Instead of relying solely on the knowledge encoded in the modelβs parameters during training, RAG dynamically retrieves relevant information from external sources (documents, databases, knowledge bases) and injects this context into the generation process.
Architecture Overview
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β RAG Pipeline β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β User Query βββΊ Retriever βββΊ Relevant Documents βββΊ LLM βββΊ Answer
β β β² β
β βΌ β β
β Knowledge Base Context Injection β β
β (Vector Store) β β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
How RAG Improves Responses Compared to Plain LLMs
| Aspect | Plain LLM | RAG-Enhanced LLM |
|---|---|---|
| Knowledge Source | Static, frozen at training time | Dynamic, can access up-to-date information |
| Factual Accuracy | May generate plausible but incorrect facts (βhallucinationsβ) | Grounds responses in retrieved evidence |
| Domain Specificity | Limited to general training data | Can access specialized/proprietary knowledge bases |
| Verifiability | Cannot cite sources | Can reference specific documents |
| Update Mechanism | Requires expensive retraining | Simply update the knowledge base |
| Cost | Needs larger models for more knowledge | Knowledge scales with database, not model size |
Key Improvements
-
Knowledge Grounding: RAG grounds the LLMβs responses in actual retrieved documents, ensuring answers are based on real information rather than statistical patterns in training data.
-
Reduced Hallucinations: By providing explicit context, the model is constrained to generate responses consistent with the retrieved information, significantly reducing the likelihood of fabricated facts.
-
Up-to-Date Information: The knowledge base can be continuously updated without retraining the model, allowing the system to provide current information about recent events, new research, or changing data.
-
Domain Expertise: Organizations can build RAG systems using their proprietary documents (manuals, research papers, internal wikis), enabling the LLM to answer questions about specialized domains it was never explicitly trained on.
-
Transparency and Trust: RAG systems can provide citations and source documents, allowing users to verify the information and building trust in the systemβs responses.
Question 2: List and briefly explain the core stages of a RAG pipeline (indexing, retrieval, augmentation, generation)
The Four Core Stages
INDEXING (Offline)
βββββββββββββββββββββββββββββββββββββββββββββββββ
β Documents β Chunking β Embedding β Storage β
βββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββ
β Vector Database β
β (Chroma, Pinecone, FAISS) β
βββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
RETRIEVAL (Online)
βββββββββββββββββββββββββββββββββββββββββββββββββ
β Query β Embed β Similarity Search β Top-K β
βββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
AUGMENTATION (Online)
βββββββββββββββββββββββββββββββββββββββββββββββββ
β Retrieved Docs + Query β Prompt Construction β
βββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
GENERATION (Online)
βββββββββββββββββββββββββββββββββββββββββββββββββ
β Augmented Prompt β LLM β Final Answer β
βββββββββββββββββββββββββββββββββββββββββββββββββ
Stage 1: Indexing
Purpose: Prepare documents for efficient semantic search.
Process:
- Document Collection: Gather raw text from various sources (PDFs, web pages, databases, documents)
- Text Chunking: Split documents into smaller, manageable pieces (typically 100-1000 tokens) because:
- LLMs have context length limits
- Smaller chunks enable more precise retrieval
- Overlapping chunks preserve context across boundaries
- Embedding Generation: Convert each chunk into a dense vector representation using an embedding model (e.g.,
sentence-transformers/all-MiniLM-L6-v2) - Vector Storage: Store embeddings in a vector database (Chroma, FAISS, Pinecone) with metadata for efficient similarity search
Example:
## Chunking
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
chunks = splitter.split_documents(documents)
## Embedding + Storage
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = Chroma.from_documents(chunks, embeddings)Stage 2: Retrieval
Purpose: Find the most relevant document chunks for a given query.
Process:
- Query Embedding: Convert the userβs question into a vector using the same embedding model
- Similarity Search: Compare the query vector against all stored document vectors
- Top-K Selection: Return the K most similar chunks (typically K=3-10)
- Ranking: Optionally re-rank results using more sophisticated methods
Similarity Metrics:
- Cosine Similarity: Measures angle between vectors (most common)
- Euclidean Distance: Measures direct distance between vectors
- Dot Product: Combines magnitude and direction
Example:
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
relevant_docs = retriever.invoke("What is RAG?")Stage 3: Augmentation
Purpose: Construct a context-enriched prompt for the LLM.
Process:
- Context Formatting: Combine retrieved chunks into a coherent context block
- Prompt Construction: Create a structured prompt that includes:
- System instructions (role, constraints)
- Retrieved context (evidence to use)
- User question (what to answer)
- Context Ordering: Optionally order chunks by relevance or recency
Prompt Design Considerations:
- Clearly separate context from question
- Instruct model to use ONLY provided context
- Include fallback instructions (βIf not in context, say βI donβt knowββ)
Example:
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
prompt = f"""You are an assistant that answers questions using ONLY the provided context.
Context:
{format_docs(retrieved_docs)}
Question: {user_question}
Answer:"""Stage 4: Generation
Purpose: Produce the final answer using the augmented prompt.
Process:
- LLM Invocation: Send the augmented prompt to the language model
- Response Generation: Model generates answer grounded in the provided context
- Output Parsing: Extract and format the response
- Optional Post-Processing: Validate, filter, or enhance the response
Key Parameters:
- Temperature: Controls randomness (0 for deterministic, >0 for creativity)
- Max Tokens: Limits response length
- Top-P / Top-K: Controls token sampling
Example:
llm = ChatGroq(model="llama-3.1-8b-instant", temperature=0)
response = llm.invoke(augmented_prompt)
answer = response.contentQuestion 3: Explain why RAG can reduce βhallucinationsβ compared to plain generative models β and why it doesnβt eliminate them entirely
What Are Hallucinations?
Hallucinations in LLMs refer to generated content that is:
- Factually incorrect
- Fabricated (doesnβt exist in reality)
- Inconsistent with reliable sources
- Plausible-sounding but false
Example: An LLM might confidently state that βAlbert Einstein invented the telephone in 1895β β this sounds plausible but is completely false.
Why RAG Reduces Hallucinations
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β HALLUCINATION REDUCTION β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Plain LLM: Question βββββββββββββββΊ Answer β
β (Unanchored) β
β β
β RAG: Question βββΊ Context βββΊ Answer β
β β² (Grounded) β
β β β
β Evidence Base β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Mechanisms that Reduce Hallucinations:
-
Explicit Evidence Grounding
- The LLM is provided with specific text excerpts to reference
- Answers are constrained by the retrieved content
- Model has less βfreedomβ to fabricate
-
Prompt Instructions
- System prompts explicitly instruct: βAnswer ONLY based on the provided contextβ
- Fallback instructions: βIf the information is not in the context, say βI donβt knowββ
- These constraints behaviorally limit hallucination
-
Reduced Reliance on Parametric Knowledge
- Plain LLMs rely entirely on knowledge encoded in weights
- RAG shifts reliance to external, verifiable sources
- External sources can be audited and updated
-
Source Attribution
- RAG can cite which documents informed the answer
- Users can verify claims against source material
- Creates accountability and transparency
-
Current Information
- Training data has a cutoff date; RAG can access recent information
- Reduces errors from outdated knowledge
Why RAG Doesnβt Eliminate Hallucinations Entirely
Despite its benefits, RAG is not a complete solution:
| Limitation | Explanation |
|---|---|
| Retrieval Failures | If relevant documents arenβt retrieved, the model may still hallucinate to fill gaps |
| Context Window Limits | Only K documents can be included; important information may be excluded |
| Irrelevant Retrieval | Semantic similarity doesnβt guarantee relevance; retrieved docs may be off-topic |
| Chunk Boundary Issues | Important information split across chunks may lose context |
| Model Behavior | LLMs may still override context with parametric knowledge, especially if confident |
| Context Ignoring | Models sometimes ignore or misinterpret provided context, especially in long prompts |
| Synthesis Errors | When combining information from multiple sources, the model may create incorrect syntheses |
| Outdated Knowledge Base | If the vector store contains outdated information, it will propagate to answers |
| Adversarial Content | Malicious or incorrect documents in the knowledge base will be retrieved and used |
Mitigation Strategies
-
Improve Retrieval Quality
- Use hybrid retrieval (BM25 + dense embeddings)
- Implement re-ranking with cross-encoders
- Increase K for broader context
-
Better Prompting
- Explicit uncertainty instructions
- Request confidence levels
- Chain-of-thought reasoning
-
Advanced RAG Techniques
- RRR-RAG: Rewrite queries for better retrieval
- Self-RAG: Iteratively refine answers
- CRAG: Critique and revise generated answers
-
Knowledge Base Curation
- Regular updates and validation
- Source quality control
- Deduplication and conflict resolution
Question 4: Compare two retrieval methods (BM25 vs Dense Embeddings) and discuss how they affect RAG performance
Overview of Retrieval Methods
| Aspect | BM25 (Sparse/Lexical) | Dense Embeddings (Semantic) |
|---|---|---|
| Type | Statistical/Keyword-based | Neural network-based |
| Representation | Sparse vectors (term frequencies) | Dense vectors (learned representations) |
| Matching | Exact/partial word matches | Semantic meaning similarity |
| Vocabulary | Depends on exact terms | Understands synonyms/paraphrases |
| Computation | Fast, lightweight | Requires embedding model |
| Index Size | Smaller (inverted index) | Larger (all document vectors) |
BM25 (Best Matching 25)
Algorithm Overview: BM25 is a probabilistic retrieval function that ranks documents based on term frequency (TF) and inverse document frequency (IDF).
BM25 Score = Ξ£ IDF(qi) Γ [f(qi,D) Γ (k1 + 1)] / [f(qi,D) + k1 Γ (1 - b + b Γ |D|/avgdl)]
Where:
- qi = query terms
- f(qi,D) = term frequency in document
- |D| = document length
- avgdl = average document length
- k1, b = tuning parameters
Strengths:
- β Exact Match Precision: Excellent when exact terminology matters (legal documents, code, technical specs)
- β No Training Required: Works out-of-the-box without neural networks
- β Fast and Scalable: Inverted index enables sub-second queries over millions of documents
- β Interpretable: Easy to understand why documents were retrieved
- β Handles Rare Terms Well: Specific/rare terms get high IDF weight
Weaknesses:
- β Vocabulary Mismatch: βautomobileβ wonβt match βcarβ
- β No Semantic Understanding: βbankβ (financial) matches βriver bankβ
- β Sensitive to Query Formulation: Requires users to guess document terminology
- β No Cross-Lingual Support: Only works within one language
Example:
from langchain_community.retrievers import BM25Retriever
bm25_retriever = BM25Retriever.from_documents(documents)
results = bm25_retriever.invoke("What is RAG?")Dense Embeddings
Algorithm Overview: Dense embeddings use neural networks (e.g., Sentence Transformers) to encode text into fixed-dimensional vector spaces where semantic similarity corresponds to vector proximity.
Sentence Transformer
Query: "How does RAG work?" βββββββββΊ [0.23, -0.15, 0.87, ...]
β
βΌ Cosine Similarity
Document: "RAG retrieves..." βββββββββΊ [0.21, -0.18, 0.85, ...]
Strengths:
- β Semantic Understanding: Understands synonyms, paraphrases, related concepts
- β Query Flexibility: Users can phrase questions naturally
- β Cross-Lingual: Multilingual models can match across languages
- β Concept Matching: βmachine learningβ matches βartificial intelligenceβ
- β Dense Representation: Captures nuanced meaning
Weaknesses:
- β Computational Cost: Requires GPU for embedding generation
- β Model Dependency: Quality depends on embedding model choice
- β Rare Terms: May miss highly specific or technical terms
- β Index Size: Dense vectors require more storage
- β False Positives: May retrieve semantically similar but irrelevant content
Example:
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = Chroma.from_documents(documents, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
results = retriever.invoke("How does RAG work?")Impact on RAG Performance
| Scenario | Better Method | Reason |
|---|---|---|
| Technical documentation (exact terms) | BM25 | Precise terminology matching |
| Customer support (varied phrasing) | Dense Embeddings | Handles question variations |
| Legal/medical domains | Hybrid | Precision AND semantic understanding |
| Code search | BM25 | Exact syntax matching |
| Conceptual questions | Dense Embeddings | Semantic relationship capture |
| Low-resource deployment | BM25 | No GPU required |
| Multilingual applications | Dense Embeddings | Cross-lingual capabilities |
Hybrid Retrieval: Best of Both Worlds
Modern RAG systems often combine both methods:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β HYBRID RETRIEVAL β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Query βββ¬βββΊ BM25 Retriever βββββββΊ Top-Kβ Documents β
β β β β
β β βΌ β
β β Reciprocal Rank β
β β Fusion β
β β β² β
β β β β
β ββββΊ Dense Retriever ββββββΊ Top-Kβ Documents β
β β
β Final Ranked Documents β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Fusion Methods:
- Reciprocal Rank Fusion (RRF): Combines rankings from multiple retrievers
- Weighted Combination: Assign weights to each methodβs scores
- Re-ranking: Use a cross-encoder to re-rank combined results
Example:
from langchain.retrievers import EnsembleRetriever
ensemble_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, dense_retriever],
weights=[0.4, 0.6] # 40% BM25, 60% Dense
)Recommendations for RAG System Design
- Start with Dense Embeddings: Better default for most QA use cases
- Add BM25 for Precision: When exact terminology matters
- Use Hybrid for Production: Combines strengths of both
- Tune K Carefully: More documents = more context but also more noise
- Consider Re-ranking: Cross-encoders can significantly improve relevance
- Evaluate on Your Data: Performance varies by domain and query types
These answers provide foundational understanding of RAG systems as required for Assignment 2.