Building a RAG Pipeline from Scratch: A Complete Tutorial
[!NOTE] This post walks through Assignment 2 from the Engineering GenAI course, implementing a complete RAG (Retrieval-Augmented Generation) pipeline with Python, LangChain, and Groq.
Introduction: Why RAG Matters
Large Language Models are impressive, but they have critical limitations:
| Problem | Description |
|---|---|
| Knowledge Cutoff | Training data ends at a specific date |
| Hallucinations | Models confidently generate false information |
| No Private Data | Canβt access your documents, databases, APIs |
| Static Knowledge | Updating requires expensive retraining |
RAG solves these problems by retrieving relevant information from external sources at inference time and grounding the LLMβs response in that retrieved context.
Part 1: Foundational Concepts
Before implementing, letβs understand the theory.
Question 1: What is RAG?
Definition: Retrieval-Augmented Generation (RAG) is a technique that enhances LLMs by fetching relevant information from external data sources at query time.
How RAG Improves Responses:
| Aspect | Plain LLM | RAG-Enhanced LLM |
|---|---|---|
| Knowledge Source | Cutoff in time | Dynamic, up-to-date |
| Factual Accuracy | May hallucinate | Sourced from documents |
| Domain Specificity | Generic training | Can be specialized |
| Verifiability | May make up citations | Cites real documents |
| Updates | Requires retraining | Just update documents |
Question 2: Core Stages of a RAG Pipeline
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 1: INGESTION (Offline) β
β Documents β Chunks β Embeddings β Vector Store β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 2: RETRIEVAL (Per Query) β
β Query β Embed β Similarity Search β Top-K Chunks β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 3: AUGMENTATION β
β Format Chunks + System Instructions + Original Question β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 4: GENERATION β
β LLM generates answer grounded in retrieved context β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Stage 1 - Ingestion:
- Gather documents (PDFs, web pages, databases)
- Split into smaller chunks
- Convert to embeddings using models like
sentence-transformers - Store in vector databases (Chroma, FAISS, Pinecone)
Stage 2 - Retrieval:
- Embed the query using the same model
- Perform similarity search in vector space
- Return top-K most relevant chunks
Stage 3 - Augmentation:
- Format retrieved chunks with clear separators
- Add instructions: βAnswer ONLY based on contextβ
- Include fallback: βIf not in context, say I donβt knowβ
Stage 4 - Generation:
- LLM receives the augmented prompt
- Generates response grounded in retrieved content
Question 3: Why RAG Reduces (But Doesnβt Eliminate) Hallucinations
Why RAG Reduces Hallucinations:
- Explicit Evidence β Model has a ground truth to reference
- System Instructions β βAnswer ONLY based on provided contextβ
- External Knowledge β Injects new information not in training
- Source Accountability β Claims trace to specific documents
- Current Information β Knowledge base can be updated
Why Hallucinations Persist:
| Limitation | Explanation |
|---|---|
| Retrieval Failures | Relevant docs not found β model fills gaps |
| Context Window Limits | Information may be truncated |
| Irrelevant Retrieval | Semantic similarity β actual relevance |
| Chunk Boundaries | Info split across chunks loses context |
| Model Override | LLM may use parametric knowledge anyway |
| Synthesis Errors | Combining sources creates inconsistencies |
Question 4: BM25 vs Dense Embeddings
BM25 (Sparse/Lexical):
- Uses term frequency + inverse document frequency
- Exact/partial word matching
- Fast and lightweight
- No neural network required
- Weakness: Canβt match synonyms (βcarβ β βautomobileβ)
Dense Embeddings (Semantic):
- Neural networks encode text to vectors
- Semantic similarity in vector space
- Handles synonyms, paraphrases, concepts
- Weakness: Misses specific technical terms
When to Use Each:
| Scenario | Better Method |
|---|---|
| Technical documentation | BM25 |
| Customer support | Dense |
| Legal/medical domains | Hybrid |
| Code search | BM25 |
| Conceptual questions | Dense |
Sweet spot: Hybrid retrieval combining both approaches.
Part 2: Implementation
2.1 Setting Up the LLM Interface
import os
from groq import Groq
groq_client = Groq(api_key=os.environ["GROQ_API_KEY"])
def call_groq_llm(prompt, model="llama-3.1-8b-instant"):
response = groq_client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt},
],
temperature=0.2, # Low for focused responses
)
return response.choices[0].message.content2.2 Document Preparation & Chunking
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
# Create knowledge base documents
docs = [
Document(
page_content="""Retrieval-Augmented Generation (RAG) combines
LLMs with information retrieval. Instead of relying solely on
knowledge encoded in parameters, RAG dynamically retrieves
relevant information from external sources at inference time.""",
metadata={"source": "rag_fundamentals", "topic": "definition"}
),
Document(
page_content="""The indexing stage is the offline preprocessing
phase where documents are prepared for efficient retrieval. This
involves document collection, text chunking, embedding generation,
and vector storage in databases like Chroma or FAISS.""",
metadata={"source": "rag_pipeline", "topic": "indexing"}
),
# ... more documents
]
# Split into chunks
splitter = RecursiveCharacterTextSplitter(
chunk_size=300, # Characters per chunk
chunk_overlap=50 # Overlap for context preservation
)
split_docs = splitter.split_documents(docs)
print(f"Original: {len(docs)} β Chunks: {len(split_docs)}")Why these chunking settings?
- 300 chars: Small enough for precise retrieval
- 50 char overlap: Preserves context across boundaries
- Recursive splitting: Uses hierarchy of separators (paragraphs β sentences β words)
2.3 Embeddings & Vector Store
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
# Initialize embedding model
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2"
)
# Create vector store
vectorstore = Chroma(
collection_name="rag_knowledge_base",
embedding_function=embeddings
)
# Add documents
vectorstore.add_documents(split_docs)
# Create retriever
retriever = vectorstore.as_retriever(
search_kwargs={"k": 5} # Return top 5 chunks
)Why all-MiniLM-L6-v2?
- 384 dimensions (good balance)
- 6 layers, 22M parameters (fast inference)
- Trained on 1B+ sentence pairs
2.4 Basic RAG Implementation
def basic_rag_answer(question, k=5):
# 1. Retrieve relevant documents
retrieved_docs = vectorstore.similarity_search(question, k=k)
# 2. Build context from chunks
context = "\n\n".join(doc.page_content for doc in retrieved_docs)
# 3. Construct grounded prompt
prompt = f"""You are a knowledgeable assistant that answers
questions based on provided context.
INSTRUCTIONS:
- Answer using ONLY the information in the context below.
- If the context doesn't contain the answer, say:
"I don't have enough information to answer this."
- Be concise but thorough.
CONTEXT:
{context}
QUESTION:
{question}
ANSWER:"""
# 4. Generate answer
return call_groq_llm(prompt)Test:
print(basic_rag_answer("What is Self-RAG?"))2.5 Web-Based RAG Pipeline
For real-world applications, ingest data from the web:
from langchain_community.document_loaders import WebBaseLoader
import bs4
# Load web pages
loader = WebBaseLoader(
web_paths=(
"https://lilianweng.github.io/posts/2023-06-23-agent/",
"https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/",
),
bs_kwargs=dict(
parse_only=bs4.SoupStrainer(
class_=("post-content", "post-title", "post-header")
)
),
)
web_docs = loader.load()
# Chunk with larger sizes for web content
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
web_splits = text_splitter.split_documents(web_docs)
# Create vector store
web_vectorstore = Chroma.from_documents(
documents=web_splits,
embedding=embeddings
)Part 3: Advanced RAG Techniques
3.1 RRR-RAG (Rewrite-Retrieve-Respond)
Problem: User queries are often ambiguous or poorly phrased for retrieval.
Solution: Rewrite the query before retrieval.
def rrr_rag_answer(question):
# Step 1: REWRITE the query
rewrite_prompt = f"""Transform this question into a search query
optimized for semantic search.
INSTRUCTIONS:
1. Expand abbreviations and acronyms
2. Add relevant keywords and synonyms
3. Remove conversational words (how, what, explain)
4. Focus on core concepts and technical terms
5. Output ONLY the rewritten query
Original Question: {question}
Rewritten Query:"""
rewritten_query = call_groq_llm(rewrite_prompt).strip()
print(f"[RRR] Original: {question}")
print(f"[RRR] Rewritten: {rewritten_query}")
# Step 2: RETRIEVE using rewritten query
docs = vectorstore.similarity_search(rewritten_query, k=5)
context = "\n\n---\n\n".join(doc.page_content for doc in docs)
# Step 3: RESPOND using original question
answer_prompt = f"""Answer based ONLY on this context:
{context}
Question: {question}
If the answer is not in the context, say "I don't have enough information."
Answer:"""
return call_groq_llm(answer_prompt)Key insight: Use rewritten query for retrieval, but original question for generation.
3.2 Self-RAG (Iterative Refinement)
Problem: Single-pass retrieval may miss important context.
Solution: Iterate: retrieve β generate β refine query β repeat.
def self_rag(question, iterations=2):
current_query = question
answer = ""
for i in range(iterations):
print(f"\n[Self-RAG] Iteration {i+1}/{iterations}")
print(f"[Self-RAG] Query: {current_query}")
# Retrieve
docs = vectorstore.similarity_search(current_query, k=5)
context = "\n\n---\n\n".join(doc.page_content for doc in docs)
# Generate answer
prompt = f"""Answer based ONLY on this context. Be comprehensive.
Context:
{context}
Question: {question}
If information is insufficient, say so.
Answer:"""
answer = call_groq_llm(prompt)
print(f"[Self-RAG] Answer preview: {answer[:100]}...")
# Refine query for next iteration (except last)
if i < iterations - 1:
refine_prompt = f"""Create a search query to find additional
relevant information.
Original question: {question}
Current answer: {answer}
Create a query focusing on gaps or related concepts not covered.
Output ONLY the refined query:"""
current_query = call_groq_llm(refine_prompt).strip()
print(f"[Self-RAG] Refined query: {current_query}")
return answerWhen to use: Complex questions requiring multiple aspects of knowledge.
Part 4: Comparing RAG Strategies
questions = [
"What is the idea of refinement loops in RAG?",
"How does rewriting improve retrieval?"
]
for q in questions:
print(f"\nQuestion: {q}")
print("\n[Basic RAG]")
print(basic_rag_answer(q))
print("\n[RRR RAG]")
print(rrr_rag_answer(q))
print("\n[Self-RAG]")
print(self_rag(q))Trade-offs:
| Strategy | LLM Calls | Quality | Use Case |
|---|---|---|---|
| Basic RAG | 1 | Good | Simple questions, cost-sensitive |
| RRR-RAG | 2 | Better | Ambiguous queries |
| Self-RAG | 2-4+ | Best | Complex, multi-faceted questions |
Key Takeaways
1. RAG = LLM + Retrieval + Grounding
The magic is in grounding generation in retrieved evidence, not just model parameters.
2. Chunking Strategy is Critical
- Too small: Loses context
- Too large: Dilutes relevance
- Sweet spot: 500-1000 chars with 10-20% overlap
3. Same Embedding Model for Query & Documents
Mixing models = broken retrieval. The query and documents must live in the same vector space.
4. Prompt Engineering Reduces Hallucinations
Explicit instructions like βAnswer ONLY from contextβ and βSay βI donβt knowβ if unsureβ are essential.
5. Advanced Techniques Improve Quality at a Cost
- RRR-RAG: 2x LLM calls for better retrieval
- Self-RAG: N iterations for progressive refinement
- Choose based on your quality/cost trade-off.
6. Hybrid Retrieval is Often Best
Combine BM25 (lexical) + Dense Embeddings (semantic) for both precision and recall.
Complete RAG Pipeline Summary
# 1. INGEST
docs = load_documents()
chunks = text_splitter.split_documents(docs)
vectorstore = Chroma.from_documents(chunks, embeddings)
# 2. RETRIEVE
query = "What is RAG?"
docs = vectorstore.similarity_search(query, k=5)
# 3. AUGMENT
context = "\n\n".join(d.page_content for d in docs)
prompt = f"Context: {context}\n\nQuestion: {query}\n\nAnswer:"
# 4. GENERATE
answer = llm.generate(prompt)Resources
- LangChain Documentation
- Chroma Vector Database
- Sentence Transformers
- RAG Paper (Lewis et al.)
- Lilian Wengβs LLM-Powered Agents Blog
Happy building! π