Building a RAG Pipeline from Scratch: A Complete Tutorial

[!NOTE] This post walks through Assignment 2 from the Engineering GenAI course, implementing a complete RAG (Retrieval-Augmented Generation) pipeline with Python, LangChain, and Groq.

Introduction: Why RAG Matters

Large Language Models are impressive, but they have critical limitations:

ProblemDescription
Knowledge CutoffTraining data ends at a specific date
HallucinationsModels confidently generate false information
No Private DataCan’t access your documents, databases, APIs
Static KnowledgeUpdating requires expensive retraining

RAG solves these problems by retrieving relevant information from external sources at inference time and grounding the LLM’s response in that retrieved context.


Part 1: Foundational Concepts

Before implementing, let’s understand the theory.

Question 1: What is RAG?

Definition: Retrieval-Augmented Generation (RAG) is a technique that enhances LLMs by fetching relevant information from external data sources at query time.

How RAG Improves Responses:

AspectPlain LLMRAG-Enhanced LLM
Knowledge SourceCutoff in timeDynamic, up-to-date
Factual AccuracyMay hallucinateSourced from documents
Domain SpecificityGeneric trainingCan be specialized
VerifiabilityMay make up citationsCites real documents
UpdatesRequires retrainingJust update documents

Question 2: Core Stages of a RAG Pipeline

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ STAGE 1: INGESTION (Offline)                                β”‚
β”‚   Documents β†’ Chunks β†’ Embeddings β†’ Vector Store            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ STAGE 2: RETRIEVAL (Per Query)                              β”‚
β”‚   Query β†’ Embed β†’ Similarity Search β†’ Top-K Chunks          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ STAGE 3: AUGMENTATION                                       β”‚
β”‚   Format Chunks + System Instructions + Original Question   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ STAGE 4: GENERATION                                         β”‚
β”‚   LLM generates answer grounded in retrieved context        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Stage 1 - Ingestion:

  • Gather documents (PDFs, web pages, databases)
  • Split into smaller chunks
  • Convert to embeddings using models like sentence-transformers
  • Store in vector databases (Chroma, FAISS, Pinecone)

Stage 2 - Retrieval:

  • Embed the query using the same model
  • Perform similarity search in vector space
  • Return top-K most relevant chunks

Stage 3 - Augmentation:

  • Format retrieved chunks with clear separators
  • Add instructions: β€œAnswer ONLY based on context”
  • Include fallback: β€œIf not in context, say I don’t know”

Stage 4 - Generation:

  • LLM receives the augmented prompt
  • Generates response grounded in retrieved content

Question 3: Why RAG Reduces (But Doesn’t Eliminate) Hallucinations

Why RAG Reduces Hallucinations:

  1. Explicit Evidence β€” Model has a ground truth to reference
  2. System Instructions β€” β€œAnswer ONLY based on provided context”
  3. External Knowledge β€” Injects new information not in training
  4. Source Accountability β€” Claims trace to specific documents
  5. Current Information β€” Knowledge base can be updated

Why Hallucinations Persist:

LimitationExplanation
Retrieval FailuresRelevant docs not found β†’ model fills gaps
Context Window LimitsInformation may be truncated
Irrelevant RetrievalSemantic similarity β‰  actual relevance
Chunk BoundariesInfo split across chunks loses context
Model OverrideLLM may use parametric knowledge anyway
Synthesis ErrorsCombining sources creates inconsistencies

Question 4: BM25 vs Dense Embeddings

BM25 (Sparse/Lexical):

  • Uses term frequency + inverse document frequency
  • Exact/partial word matching
  • Fast and lightweight
  • No neural network required
  • Weakness: Can’t match synonyms (β€œcar” β‰  β€œautomobile”)

Dense Embeddings (Semantic):

  • Neural networks encode text to vectors
  • Semantic similarity in vector space
  • Handles synonyms, paraphrases, concepts
  • Weakness: Misses specific technical terms

When to Use Each:

ScenarioBetter Method
Technical documentationBM25
Customer supportDense
Legal/medical domainsHybrid
Code searchBM25
Conceptual questionsDense

Sweet spot: Hybrid retrieval combining both approaches.


Part 2: Implementation

2.1 Setting Up the LLM Interface

import os
from groq import Groq
 
groq_client = Groq(api_key=os.environ["GROQ_API_KEY"])
 
def call_groq_llm(prompt, model="llama-3.1-8b-instant"):
    response = groq_client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt},
        ],
        temperature=0.2,  # Low for focused responses
    )
    return response.choices[0].message.content

2.2 Document Preparation & Chunking

from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
 
# Create knowledge base documents
docs = [
    Document(
        page_content="""Retrieval-Augmented Generation (RAG) combines 
LLMs with information retrieval. Instead of relying solely on 
knowledge encoded in parameters, RAG dynamically retrieves 
relevant information from external sources at inference time.""",
        metadata={"source": "rag_fundamentals", "topic": "definition"}
    ),
    Document(
        page_content="""The indexing stage is the offline preprocessing 
phase where documents are prepared for efficient retrieval. This 
involves document collection, text chunking, embedding generation, 
and vector storage in databases like Chroma or FAISS.""",
        metadata={"source": "rag_pipeline", "topic": "indexing"}
    ),
    # ... more documents
]
 
# Split into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,      # Characters per chunk
    chunk_overlap=50     # Overlap for context preservation
)
split_docs = splitter.split_documents(docs)
 
print(f"Original: {len(docs)} β†’ Chunks: {len(split_docs)}")

Why these chunking settings?

  • 300 chars: Small enough for precise retrieval
  • 50 char overlap: Preserves context across boundaries
  • Recursive splitting: Uses hierarchy of separators (paragraphs β†’ sentences β†’ words)

2.3 Embeddings & Vector Store

from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
 
# Initialize embedding model
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)
 
# Create vector store
vectorstore = Chroma(
    collection_name="rag_knowledge_base",
    embedding_function=embeddings
)
 
# Add documents
vectorstore.add_documents(split_docs)
 
# Create retriever
retriever = vectorstore.as_retriever(
    search_kwargs={"k": 5}  # Return top 5 chunks
)

Why all-MiniLM-L6-v2?

  • 384 dimensions (good balance)
  • 6 layers, 22M parameters (fast inference)
  • Trained on 1B+ sentence pairs

2.4 Basic RAG Implementation

def basic_rag_answer(question, k=5):
    # 1. Retrieve relevant documents
    retrieved_docs = vectorstore.similarity_search(question, k=k)
    
    # 2. Build context from chunks
    context = "\n\n".join(doc.page_content for doc in retrieved_docs)
    
    # 3. Construct grounded prompt
    prompt = f"""You are a knowledgeable assistant that answers 
questions based on provided context.
 
INSTRUCTIONS:
- Answer using ONLY the information in the context below.
- If the context doesn't contain the answer, say: 
  "I don't have enough information to answer this."
- Be concise but thorough.
 
CONTEXT: 
{context}
 
QUESTION: 
{question}
 
ANSWER:"""
    
    # 4. Generate answer
    return call_groq_llm(prompt)

Test:

print(basic_rag_answer("What is Self-RAG?"))

2.5 Web-Based RAG Pipeline

For real-world applications, ingest data from the web:

from langchain_community.document_loaders import WebBaseLoader
import bs4
 
# Load web pages
loader = WebBaseLoader(
    web_paths=(
        "https://lilianweng.github.io/posts/2023-06-23-agent/",
        "https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/",
    ),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)
web_docs = loader.load()
 
# Chunk with larger sizes for web content
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
web_splits = text_splitter.split_documents(web_docs)
 
# Create vector store
web_vectorstore = Chroma.from_documents(
    documents=web_splits,
    embedding=embeddings
)

Part 3: Advanced RAG Techniques

3.1 RRR-RAG (Rewrite-Retrieve-Respond)

Problem: User queries are often ambiguous or poorly phrased for retrieval.

Solution: Rewrite the query before retrieval.

def rrr_rag_answer(question):
    # Step 1: REWRITE the query
    rewrite_prompt = f"""Transform this question into a search query 
optimized for semantic search.
 
INSTRUCTIONS:
1. Expand abbreviations and acronyms
2. Add relevant keywords and synonyms
3. Remove conversational words (how, what, explain)
4. Focus on core concepts and technical terms
5. Output ONLY the rewritten query
 
Original Question: {question}
 
Rewritten Query:"""
    
    rewritten_query = call_groq_llm(rewrite_prompt).strip()
    print(f"[RRR] Original: {question}")
    print(f"[RRR] Rewritten: {rewritten_query}")
    
    # Step 2: RETRIEVE using rewritten query
    docs = vectorstore.similarity_search(rewritten_query, k=5)
    context = "\n\n---\n\n".join(doc.page_content for doc in docs)
    
    # Step 3: RESPOND using original question
    answer_prompt = f"""Answer based ONLY on this context:
 
{context}
 
Question: {question}
 
If the answer is not in the context, say "I don't have enough information."
 
Answer:"""
    
    return call_groq_llm(answer_prompt)

Key insight: Use rewritten query for retrieval, but original question for generation.


3.2 Self-RAG (Iterative Refinement)

Problem: Single-pass retrieval may miss important context.

Solution: Iterate: retrieve β†’ generate β†’ refine query β†’ repeat.

def self_rag(question, iterations=2):
    current_query = question
    answer = ""
    
    for i in range(iterations):
        print(f"\n[Self-RAG] Iteration {i+1}/{iterations}")
        print(f"[Self-RAG] Query: {current_query}")
        
        # Retrieve
        docs = vectorstore.similarity_search(current_query, k=5)
        context = "\n\n---\n\n".join(doc.page_content for doc in docs)
        
        # Generate answer
        prompt = f"""Answer based ONLY on this context. Be comprehensive.
 
Context:
{context}
 
Question: {question}
 
If information is insufficient, say so.
 
Answer:"""
        
        answer = call_groq_llm(prompt)
        print(f"[Self-RAG] Answer preview: {answer[:100]}...")
        
        # Refine query for next iteration (except last)
        if i < iterations - 1:
            refine_prompt = f"""Create a search query to find additional 
relevant information.
 
Original question: {question}
Current answer: {answer}
 
Create a query focusing on gaps or related concepts not covered.
Output ONLY the refined query:"""
            
            current_query = call_groq_llm(refine_prompt).strip()
            print(f"[Self-RAG] Refined query: {current_query}")
    
    return answer

When to use: Complex questions requiring multiple aspects of knowledge.


Part 4: Comparing RAG Strategies

questions = [
    "What is the idea of refinement loops in RAG?",
    "How does rewriting improve retrieval?"
]
 
for q in questions:
    print(f"\nQuestion: {q}")
    print("\n[Basic RAG]")
    print(basic_rag_answer(q))
    print("\n[RRR RAG]")
    print(rrr_rag_answer(q))
    print("\n[Self-RAG]")
    print(self_rag(q))

Trade-offs:

StrategyLLM CallsQualityUse Case
Basic RAG1GoodSimple questions, cost-sensitive
RRR-RAG2BetterAmbiguous queries
Self-RAG2-4+BestComplex, multi-faceted questions

Key Takeaways

1. RAG = LLM + Retrieval + Grounding

The magic is in grounding generation in retrieved evidence, not just model parameters.

2. Chunking Strategy is Critical

  • Too small: Loses context
  • Too large: Dilutes relevance
  • Sweet spot: 500-1000 chars with 10-20% overlap

3. Same Embedding Model for Query & Documents

Mixing models = broken retrieval. The query and documents must live in the same vector space.

4. Prompt Engineering Reduces Hallucinations

Explicit instructions like β€œAnswer ONLY from context” and β€œSay β€˜I don’t know’ if unsure” are essential.

5. Advanced Techniques Improve Quality at a Cost

  • RRR-RAG: 2x LLM calls for better retrieval
  • Self-RAG: N iterations for progressive refinement
  • Choose based on your quality/cost trade-off.

6. Hybrid Retrieval is Often Best

Combine BM25 (lexical) + Dense Embeddings (semantic) for both precision and recall.


Complete RAG Pipeline Summary

# 1. INGEST
docs = load_documents()
chunks = text_splitter.split_documents(docs)
vectorstore = Chroma.from_documents(chunks, embeddings)
 
# 2. RETRIEVE
query = "What is RAG?"
docs = vectorstore.similarity_search(query, k=5)
 
# 3. AUGMENT
context = "\n\n".join(d.page_content for d in docs)
prompt = f"Context: {context}\n\nQuestion: {query}\n\nAnswer:"
 
# 4. GENERATE
answer = llm.generate(prompt)

Resources


Happy building! πŸš€