Building a RAG Pipeline from Scratch: A Complete Tutorial

[!NOTE] This post walks through Assignment 2 from the Engineering GenAI course, implementing a complete RAG (Retrieval-Augmented Generation) pipeline with Python, LangChain, and Groq.

Introduction: Why RAG Matters

Large Language Models are impressive, but they have critical limitations:

Problem	Description
Knowledge Cutoff	Training data ends at a specific date
Hallucinations	Models confidently generate false information
No Private Data	Can’t access your documents, databases, APIs
Static Knowledge	Updating requires expensive retraining

RAG solves these problems by retrieving relevant information from external sources at inference time and grounding the LLM’s response in that retrieved context.

Part 1: Foundational Concepts

Before implementing, let’s understand the theory.

Question 1: What is RAG?

Definition: Retrieval-Augmented Generation (RAG) is a technique that enhances LLMs by fetching relevant information from external data sources at query time.

How RAG Improves Responses:

Aspect	Plain LLM	RAG-Enhanced LLM
Knowledge Source	Cutoff in time	Dynamic, up-to-date
Factual Accuracy	May hallucinate	Sourced from documents
Domain Specificity	Generic training	Can be specialized
Verifiability	May make up citations	Cites real documents
Updates	Requires retraining	Just update documents

Question 2: Core Stages of a RAG Pipeline

┌─────────────────────────────────────────────────────────────┐
│ STAGE 1: INGESTION (Offline)                                │
│   Documents → Chunks → Embeddings → Vector Store            │
└─────────────────────────────────────────────────────────────┘
                           ↓
┌─────────────────────────────────────────────────────────────┐
│ STAGE 2: RETRIEVAL (Per Query)                              │
│   Query → Embed → Similarity Search → Top-K Chunks          │
└─────────────────────────────────────────────────────────────┘
                           ↓
┌─────────────────────────────────────────────────────────────┐
│ STAGE 3: AUGMENTATION                                       │
│   Format Chunks + System Instructions + Original Question   │
└─────────────────────────────────────────────────────────────┘
                           ↓
┌─────────────────────────────────────────────────────────────┐
│ STAGE 4: GENERATION                                         │
│   LLM generates answer grounded in retrieved context        │
└─────────────────────────────────────────────────────────────┘

Stage 1 - Ingestion:

Gather documents (PDFs, web pages, databases)
Split into smaller chunks
Convert to embeddings using models like sentence-transformers
Store in vector databases (Chroma, FAISS, Pinecone)

Stage 2 - Retrieval:

Embed the query using the same model
Perform similarity search in vector space
Return top-K most relevant chunks

Stage 3 - Augmentation:

Format retrieved chunks with clear separators
Add instructions: “Answer ONLY based on context”
Include fallback: “If not in context, say I don’t know”

Stage 4 - Generation:

LLM receives the augmented prompt
Generates response grounded in retrieved content

Question 3: Why RAG Reduces (But Doesn’t Eliminate) Hallucinations

Why RAG Reduces Hallucinations:

Explicit Evidence — Model has a ground truth to reference
System Instructions — “Answer ONLY based on provided context”
External Knowledge — Injects new information not in training
Source Accountability — Claims trace to specific documents
Current Information — Knowledge base can be updated

Why Hallucinations Persist:

Limitation	Explanation
Retrieval Failures	Relevant docs not found → model fills gaps
Context Window Limits	Information may be truncated
Irrelevant Retrieval	Semantic similarity ≠ actual relevance
Chunk Boundaries	Info split across chunks loses context
Model Override	LLM may use parametric knowledge anyway
Synthesis Errors	Combining sources creates inconsistencies

Question 4: BM25 vs Dense Embeddings

BM25 (Sparse/Lexical):

Uses term frequency + inverse document frequency
Exact/partial word matching
Fast and lightweight
No neural network required
Weakness: Can’t match synonyms (“car” ≠ “automobile”)

Dense Embeddings (Semantic):

Neural networks encode text to vectors
Semantic similarity in vector space
Handles synonyms, paraphrases, concepts
Weakness: Misses specific technical terms

When to Use Each:

Scenario	Better Method
Technical documentation	BM25
Customer support	Dense
Legal/medical domains	Hybrid
Code search	BM25
Conceptual questions	Dense

Sweet spot: Hybrid retrieval combining both approaches.

Part 2: Implementation

2.1 Setting Up the LLM Interface

import os
from groq import Groq
 
groq_client = Groq(api_key=os.environ["GROQ_API_KEY"])
 
def call_groq_llm(prompt, model="llama-3.1-8b-instant"):
    response = groq_client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt},
        ],
        temperature=0.2,  # Low for focused responses
    )
    return response.choices[0].message.content

2.2 Document Preparation & Chunking

from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
 
# Create knowledge base documents
docs = [
    Document(
        page_content="""Retrieval-Augmented Generation (RAG) combines 
LLMs with information retrieval. Instead of relying solely on 
knowledge encoded in parameters, RAG dynamically retrieves 
relevant information from external sources at inference time.""",
        metadata={"source": "rag_fundamentals", "topic": "definition"}
    ),
    Document(
        page_content="""The indexing stage is the offline preprocessing 
phase where documents are prepared for efficient retrieval. This 
involves document collection, text chunking, embedding generation, 
and vector storage in databases like Chroma or FAISS.""",
        metadata={"source": "rag_pipeline", "topic": "indexing"}
    ),
    # ... more documents
]
 
# Split into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,      # Characters per chunk
    chunk_overlap=50     # Overlap for context preservation
)
split_docs = splitter.split_documents(docs)
 
print(f"Original: {len(docs)} → Chunks: {len(split_docs)}")

Why these chunking settings?

300 chars: Small enough for precise retrieval
50 char overlap: Preserves context across boundaries
Recursive splitting: Uses hierarchy of separators (paragraphs → sentences → words)

2.3 Embeddings & Vector Store

from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
 
# Initialize embedding model
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)
 
# Create vector store
vectorstore = Chroma(
    collection_name="rag_knowledge_base",
    embedding_function=embeddings
)
 
# Add documents
vectorstore.add_documents(split_docs)
 
# Create retriever
retriever = vectorstore.as_retriever(
    search_kwargs={"k": 5}  # Return top 5 chunks
)

Why all-MiniLM-L6-v2?

384 dimensions (good balance)
6 layers, 22M parameters (fast inference)
Trained on 1B+ sentence pairs

2.4 Basic RAG Implementation

def basic_rag_answer(question, k=5):
    # 1. Retrieve relevant documents
    retrieved_docs = vectorstore.similarity_search(question, k=k)
    
    # 2. Build context from chunks
    context = "\n\n".join(doc.page_content for doc in retrieved_docs)
    
    # 3. Construct grounded prompt
    prompt = f"""You are a knowledgeable assistant that answers 
questions based on provided context.
 
INSTRUCTIONS:
- Answer using ONLY the information in the context below.
- If the context doesn't contain the answer, say: 
  "I don't have enough information to answer this."
- Be concise but thorough.
 
CONTEXT: 
{context}
 
QUESTION: 
{question}
 
ANSWER:"""
    
    # 4. Generate answer
    return call_groq_llm(prompt)

Test:

print(basic_rag_answer("What is Self-RAG?"))

2.5 Web-Based RAG Pipeline

For real-world applications, ingest data from the web:

from langchain_community.document_loaders import WebBaseLoader
import bs4
 
# Load web pages
loader = WebBaseLoader(
    web_paths=(
        "https://lilianweng.github.io/posts/2023-06-23-agent/",
        "https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/",
    ),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)
web_docs = loader.load()
 
# Chunk with larger sizes for web content
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
web_splits = text_splitter.split_documents(web_docs)
 
# Create vector store
web_vectorstore = Chroma.from_documents(
    documents=web_splits,
    embedding=embeddings
)

Part 3: Advanced RAG Techniques

3.1 RRR-RAG (Rewrite-Retrieve-Respond)

Problem: User queries are often ambiguous or poorly phrased for retrieval.

Solution: Rewrite the query before retrieval.

def rrr_rag_answer(question):
    # Step 1: REWRITE the query
    rewrite_prompt = f"""Transform this question into a search query 
optimized for semantic search.
 
INSTRUCTIONS:
1. Expand abbreviations and acronyms
2. Add relevant keywords and synonyms
3. Remove conversational words (how, what, explain)
4. Focus on core concepts and technical terms
5. Output ONLY the rewritten query
 
Original Question: {question}
 
Rewritten Query:"""
    
    rewritten_query = call_groq_llm(rewrite_prompt).strip()
    print(f"[RRR] Original: {question}")
    print(f"[RRR] Rewritten: {rewritten_query}")
    
    # Step 2: RETRIEVE using rewritten query
    docs = vectorstore.similarity_search(rewritten_query, k=5)
    context = "\n\n---\n\n".join(doc.page_content for doc in docs)
    
    # Step 3: RESPOND using original question
    answer_prompt = f"""Answer based ONLY on this context:
 
{context}
 
Question: {question}
 
If the answer is not in the context, say "I don't have enough information."
 
Answer:"""
    
    return call_groq_llm(answer_prompt)

Key insight: Use rewritten query for retrieval, but original question for generation.

Problem: Single-pass retrieval may miss important context.

Solution: Iterate: retrieve → generate → refine query → repeat.

def self_rag(question, iterations=2):
    current_query = question
    answer = ""
    
    for i in range(iterations):
        print(f"\n[Self-RAG] Iteration {i+1}/{iterations}")
        print(f"[Self-RAG] Query: {current_query}")
        
        # Retrieve
        docs = vectorstore.similarity_search(current_query, k=5)
        context = "\n\n---\n\n".join(doc.page_content for doc in docs)
        
        # Generate answer
        prompt = f"""Answer based ONLY on this context. Be comprehensive.
 
Context:
{context}
 
Question: {question}
 
If information is insufficient, say so.
 
Answer:"""
        
        answer = call_groq_llm(prompt)
        print(f"[Self-RAG] Answer preview: {answer[:100]}...")
        
        # Refine query for next iteration (except last)
        if i < iterations - 1:
            refine_prompt = f"""Create a search query to find additional 
relevant information.
 
Original question: {question}
Current answer: {answer}
 
Create a query focusing on gaps or related concepts not covered.
Output ONLY the refined query:"""
            
            current_query = call_groq_llm(refine_prompt).strip()
            print(f"[Self-RAG] Refined query: {current_query}")
    
    return answer

When to use: Complex questions requiring multiple aspects of knowledge.

Part 4: Comparing RAG Strategies

questions = [
    "What is the idea of refinement loops in RAG?",
    "How does rewriting improve retrieval?"
]
 
for q in questions:
    print(f"\nQuestion: {q}")
    print("\n[Basic RAG]")
    print(basic_rag_answer(q))
    print("\n[RRR RAG]")
    print(rrr_rag_answer(q))
    print("\n[Self-RAG]")
    print(self_rag(q))

Trade-offs:

Strategy	LLM Calls	Quality	Use Case
Basic RAG	1	Good	Simple questions, cost-sensitive
RRR-RAG	2	Better	Ambiguous queries
Self-RAG	2-4+	Best	Complex, multi-faceted questions

Key Takeaways

1. RAG = LLM + Retrieval + Grounding

The magic is in grounding generation in retrieved evidence, not just model parameters.

2. Chunking Strategy is Critical

Too small: Loses context
Too large: Dilutes relevance
Sweet spot: 500-1000 chars with 10-20% overlap

3. Same Embedding Model for Query & Documents

Mixing models = broken retrieval. The query and documents must live in the same vector space.

4. Prompt Engineering Reduces Hallucinations

Explicit instructions like “Answer ONLY from context” and “Say ‘I don’t know’ if unsure” are essential.

5. Advanced Techniques Improve Quality at a Cost

RRR-RAG: 2x LLM calls for better retrieval
Self-RAG: N iterations for progressive refinement
Choose based on your quality/cost trade-off.

6. Hybrid Retrieval is Often Best

Combine BM25 (lexical) + Dense Embeddings (semantic) for both precision and recall.

Complete RAG Pipeline Summary

# 1. INGEST
docs = load_documents()
chunks = text_splitter.split_documents(docs)
vectorstore = Chroma.from_documents(chunks, embeddings)
 
# 2. RETRIEVE
query = "What is RAG?"
docs = vectorstore.similarity_search(query, k=5)
 
# 3. AUGMENT
context = "\n\n".join(d.page_content for d in docs)
prompt = f"Context: {context}\n\nQuestion: {query}\n\nAnswer:"
 
# 4. GENERATE
answer = llm.generate(prompt)

Resources

Happy building! 🚀

🧠 ज्ञान उद्यान

Explorer

Recent Notes

Building a ReAct Agent from Scratch: MockLLM vs Real LLM

Map of Content