Building a ReAct Agent from Scratch: MockLLM vs Real LLM

NOTE

This post walks through Assignment 3 from the Engineering GenAI course, where we build a ReAct Agent that acts as a Dungeon Master. We’ll compare two implementations: MockLLM (for testing) and Real LLM (Groq, for production).

Introduction: What is an AI Agent?

Traditional chatbots simply respond to inputs. AI Agents are different—they can:

Reason about what to do
Act by using external tools
Observe the results and incorporate them into their reasoning

The ReAct pattern (Reason + Act) formalizes this into a structured loop:

Thought: <reasoning about what to do>
Action: <tool to call>
Observation: <result from tool>
... (repeat as needed)
Final Answer: <response to user>

In this tutorial, we’ll build a Dungeon Master agent that uses a dice-rolling tool to determine success/failure of player actions. We’ll implement it two ways to understand the difference between testing and production approaches.

Part 1: The Foundation (Same for Both Approaches)

Both MockLLM and RealLLM share the same foundation: tools, system prompt, parser, and agent loop structure.

The Tool: `roll_dice`

Every agent needs tools—functions it can call to interact with the world.

import random
 
def roll_dice(sides=20):
    """
    Simulates rolling a die.
    Args:
        sides (int): The number of sides on the die (default 20).
    Returns:
        int: A random number between 1 and \"sides\".
    """
    return random.randint(1, sides)

Why it matters: When roll_dice() returns 17, that’s a real, deterministic result—not something the LLM imagined.

The System Prompt

The LLM needs explicit instructions on how to behave:

SYSTEM_PROMPT = \"\"\"
You are a text adventure Dungeon Master. Your goal is to narrate an 
exciting, immersive story for the player.
 
## Your Capabilities
You have access to the following tool:
- **roll_dice**: Rolls a 20-sided die and returns a number between 1 and 20.
 
## When to Use Tools
When the player attempts an action with an UNCERTAIN outcome 
(attacking, jumping, persuading, etc.), you MUST:
1. First, explain your reasoning using \"Thought:\"
2. Then, call the tool using \"Action: roll_dice\"
 
Example:

Thought: The player is trying to attack the goblin. This requires a roll to determine success. Action: roll_dice


## How to Interpret Results
After receiving the Observation (dice result), narrate the outcome:
- **1-5 (Critical Fail):** Action fails spectacularly
- **6-10 (Fail):** Action fails, but not catastrophically  
- **11-15 (Success):** Action succeeds
- **16-20 (Critical Hit):** Action succeeds spectacularly

## When NOT to Use Tools
For simple conversation, exploration with no risk, or describing the 
environment, respond directly without using tools.

Stay in character as a dramatic, engaging Dungeon Master!
\"\"\"

The Parser

Detects if the LLM is requesting a tool:

def parse_response(llm_output):
    """Returns True if LLM requested roll_dice, False otherwise."""
    return \"action: roll_dice\" in llm_output.lower()

The Agent Loop

The generic structure that works with any LLM:

def run_turn(user_input, history=[]):
    """Executes a single turn of the agent."""
    # Build context
    context = \"\\n\".join(history) + f\"\\nUser: {user_input}\\n\"
 
    # First LLM call
    response = llm.generate(context)
    print(f\"🤖 Agent: {response}\")
 
    # Check for tool use
    if parse_response(response):
        # Execute tool
        dice_result = roll_dice()
        print(f\"🎲 [Tool Executed] roll_dice() returned: {dice_result}\")
        
        # Add observation
        observation = f\"Observation: {dice_result}\"
        context = context + f\"\\nAgent: {response}\\n{observation}\\n\"
        
        # Second LLM call with result
        response = llm.generate(context)
        print(f\"🤖 Agent (Final): {response}\")
 
    return response

Key point: The loop doesn’t care whether it’s talking to MockLLM or RealLLM—it just calls llm.generate().

Part 2: The Two Approaches - MockLLM vs RealLLM

Now here’s where things diverge. Let’s compare how each approach implements the llm object and why they behave differently.

Approach 1: MockLLM (Testing Implementation)

What It Is

A fake LLM that uses pattern matching to simulate responses. No API required, no costs, purely for testing your agent loop logic.

Complete Code

class MockLLM:
    \"\"\"A fake LLM that responds predictably for testing logic.\"\"\"
    
    def generate(self, prompt):
        # If the prompt shows we just rolled a die (Observation), narrate
        if \"Observation:\" in prompt:
            return \"The die rolls... a critical hit! The goblin falls.\"
        
        # If the user says \"attack\", the LLM decides to roll
        if \"attack\" in prompt.lower():
            return \"Thought: The player is attacking. I need to check if they hit. Action: roll_dice\"
        
        # Default conversation
        return \"What do you want to do next?\"
 
# Initialize
llm = MockLLM()

How It Works

MockLLM uses simple pattern matching:

Checks if \"attack\" is in the input → returns action request
Checks if \"Observation:\" is in the prompt → returns hardcoded narrative
Otherwise → returns generic response

No actual reasoning, just if/else logic.

Execution Flow: “I attack the goblin!”

┌─────────────────────────────────┐
│ User: \"I attack the goblin!\"    │
└────────────┬────────────────────┘
             │
             ▼
┌─────────────────────────────────────────┐
│ llm.generate(\"User: I attack...\")       │
│ Checks: \"attack\" in prompt.lower()?     │
│ → YES                                   │
│ Returns: \"Thought:... Action: roll_dice\"│
└────────────┬────────────────────────────┘
             │
             ▼
┌─────────────────────────────────┐
│ parse_response() → True         │
│ Execute: dice_result = 3        │ ← Let's say we roll a 3
└────────────┬────────────────────┘
             │
             ▼
┌──────────────────────────────────────────┐
│ llm.generate(\"...Observation: 3\")        │
│ Checks: \"Observation:\" in prompt?        │
│ → YES                                    │
│ Returns: \"...critical hit! Goblin falls.\"│ ← Hardcoded! Ignores the 3!
└──────────────────────────────────────────┘

The Problem

MockLLM doesn’t read the dice value. Whether the observation is 3 (critical fail) or 20 (critical hit), it always returns:

\"The die rolls... a critical hit! The goblin falls.\"

It just checks for the keyword \"Observation:\" and returns a fixed string.

Approach 2: Real LLM (Production Implementation with Groq)

What It Is

A real neural network (Llama 3.3, 70 billion parameters) that:

Actually reads and understands the conversation
Follows the system prompt rules dynamically
Generates contextual responses based on dice results

Complete Code

from groq import Groq
import os
 
class RealLLM:
    \"\"\"A real LLM using Groq API for fast inference\"\"\"
    
    def __init__(self, api_key, model=\"llama-3.3-70b-versatile\"):
        self.client = Groq(api_key=api_key)
        self.model = model
        # Initialize conversation with system prompt
        self.messages = [{\"role\": \"system\", \"content\": SYSTEM_PROMPT}]
    
    def generate(self, prompt):
        \"\"\"Generate a response from Groq.\"\"\"
        # Add user message to conversation history
        self.messages.append({\"role\": \"user\", \"content\": prompt})
        
        # Get response from Groq
        response = self.client.chat.completions.create(
            model=self.model,
            messages=self.messages
        )
        
        assistant_message = response.choices[0].message.content
        
        # Add assistant response to conversation history
        self.messages.append({\"role\": \"assistant\", \"content\": assistant_message})
        
        return assistant_message
 
# Initialize
GROQ_API_KEY = os.getenv(\"GROQ_API_KEY\")
llm = RealLLM(api_key=GROQ_API_KEY)

How It Works

RealLLM uses API calls to a neural network:

Maintains full conversation history in self.messages
Sends entire conversation to Groq servers
Llama 3.3 reads and understands the context
Generates response following system prompt rules
Stores response in conversation history

Actual reasoning, not pattern matching.

Execution Flow: “I attack the goblin!” (with dice = 3)

┌─────────────────────────────────┐
│ User: \"I attack the goblin!\"    │
└────────────┬────────────────────┘
             │
             ▼
┌────────────────────────────────────────────────┐
│ llm.generate(\"User: I attack...\")              │
│ Sends to Groq:                                 │
│ [{\"role\": \"system\", \"content\": SYSTEM_PROMPT},  │
│  {\"role\": \"user\", \"content\": \"User: I attack...\"}]│
│                                                │
│ Llama 3.3 reads:                               │
│ - System: \"For uncertain outcomes, roll dice\" │
│ - User: \"I attack the goblin\"                 │
│ - Reasoning: \"Attack = uncertain, need roll\"  │
│                                                │
│ Returns: \"Thought:... Action: roll_dice\"      │
└────────────┬───────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────┐
│ parse_response() → True         │
│ Execute: dice_result = 3        │ ← Let's say we roll a 3
└────────────┬────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────────┐
│ llm.generate(\"...Observation: 3\")               │
│ Sends to Groq:                                  │
│ [...previous messages...,                       │
│  {\"role\": \"user\", \"content\": \"Observation: 3\"}]  │
│                                                 │
│ Llama 3.3 reads:                                │
│ - Previous: \"I need to roll\"                   │
│ - Observation: 3                                │ ← Actually reads this!
│ - System rules: \"1-5 = Critical Fail\"           │
│ - Reasoning: \"3 is in 1-5 range, spectacular fail\" │
│                                                 │
│ Returns: \"You swing wildly and miss! Your      │
│ blade clatters harmlessly against the cave     │
│ wall. The goblin cackles with glee!\"           │ ← Contextual!
└─────────────────────────────────────────────────┘

Why It’s Different

RealLLM reads the dice value (3) and applies the system prompt rule (1-5 = Critical Fail), generating a narrative that matches the roll.

Part 3: Side-by-Side Comparison

Code Differences

Aspect	MockLLM	RealLLM
Import	None	`from groq import Groq`
Initialization	`llm = MockLLM()`	`llm = RealLLM(api_key=...)`
Storage	No state (stateless)	`self.messages` (conversation history)
Logic	Pattern matching (`if \"attack\" in prompt`)	Neural network inference
API Call	None	`self.client.chat.completions.create(...)`
Cost	Free	Pay-per-token
Speed	Instant	~1-2 seconds (Groq is fast!)

Behavior Differences

Same scenario: “I attack the goblin!” with dice roll = 3

Step	MockLLM	RealLLM
First Response	”Thought:… Action: roll_dice"	"Thought:… Action: roll_dice”
Dice Roll	3	3
Reads Value?	❌ No (ignores it)	✅ Yes (actually reads “3”)
Applies Rules?	❌ No	✅ Yes (sees 3 → Critical Fail)
Final Narrative	”…critical hit! Goblin falls.” ❌ WRONG	”You swing wildly and miss…” ✅ CORRECT

When to Use Which

Use MockLLM when:

🎓 Learning agent architecture
🧪 Testing agent loop logic
🔧 Debugging prompt parsing
💰 Avoiding API costs during development
⚡ Need instant responses for unit tests

Use RealLLM when:

🚀 Building production applications
🎮 Demonstrating dynamic AI behavior
📊 Showcasing true agent capabilities
🌟 Creating actual user experiences
🎯 Need contextual, varied responses

Part 4: How RealLLM Actually Works (Deep Dive)

Let’s trace through a complete interaction to see what happens under the hood.

Scenario: “I attack the goblin!” (dice rolls 17)

Step 1: Initialization

llm = RealLLM(api_key=GROQ_API_KEY)

What happens:

Creates Groq client connection
Initializes self.messages with system prompt

State:

self.messages = [
    {\"role\": \"system\", \"content\": \"You are a Dungeon Master...\"}
]

Step 2: First LLM Call

response = llm.generate(\"User: I attack the goblin!\")

Inside generate():

Add user message:

self.messages.append({
    \"role\": \"user\",
    \"content\": \"User: I attack the goblin!\"
})

Call Groq API:

response = self.client.chat.completions.create(
    model=\"llama-3.3-70b-versatile\",
    messages=self.messages  # [system, user]
)

Groq processes:
- Llama 3.3 reads system prompt + user input
- Recognizes “attack” needs dice roll
- Generates: \"Thought: The player is attacking... Action: roll_dice\"

Store response:

self.messages.append({
    \"role\": \"assistant\",
    \"content\": \"Thought: The player is attacking... Action: roll_dice\"
})

State:

self.messages = [
    {\"role\": \"system\", \"content\": \"You are a DM...\"},
    {\"role\": \"user\", \"content\": \"User: I attack the goblin!\"},
    {\"role\": \"assistant\", \"content\": \"Thought:... Action: roll_dice\"}
]

Step 3: Tool Execution

dice_result = roll_dice()  # Returns 17
observation = \"Observation: 17\"

Step 4: Second LLM Call

response = llm.generate(\"...\\nObservation: 17\")

Inside generate() (second time):

Add observation:

self.messages.append({
    \"role\": \"user\",
    \"content\": \"Observation: 17\"
})

Call Groq API:

response = self.client.chat.completions.create(
    model=\"llama-3.3-70b-versatile\",
    messages=self.messages  # [system, user, assistant, observation]
)

Groq processes:
- Llama 3.3 reads full conversation
- Sees Observation: 17
- Applies rule: “16-20 = Critical Hit”
- Generates contextual narrative: \"Your blade arcs through the air with devastating precision...\"

Store response:

self.messages.append({
    \"role\": \"assistant\",
    \"content\": \"Your blade arcs through the air...\"
})

Final State:

self.messages = [
    {\"role\": \"system\", \"content\": \"You are a DM...\"},
    {\"role\": \"user\", \"content\": \"User: I attack the goblin!\"},
    {\"role\": \"assistant\", \"content\": \"Thought:... Action: roll_dice\"},
    {\"role\": \"user\", \"content\": \"Observation: 17\"},
    {\"role\": \"assistant\", \"content\": \"Your blade arcs through the air...\"}
]

This conversation history persists—if the player makes another action, the entire history is sent to Groq, giving the LLM full context!

Why Conversation History Matters

Without history (MockLLM):

# Each call is independent, no memory
generate(\"User: I attack\")       # Returns action request
generate(\"Observation: 17\")     # Doesn't know about previous attack!

With history (RealLLM):

# Each call builds on previous
self.messages:
  1. System: \"You're a DM\"
  2. User: \"I attack\"
  3. Assistant: \"Action: roll_dice\"
  4. User: \"Observation: 17\"          ← LLM knows this relates to the attack!
  5. Assistant: \"Critical hit! ...\"

The LLM can reason: “They attacked, rolled 17, that’s a crit hit, I should narrate accordingly.”

Part 5: Practical Examples

MockLLM: Always the Same

No matter what we roll, MockLLM gives the same output:

llm = MockLLM()
 
# Roll 1 (critical fail)
run_turn(\"I attack the goblin!\")
# 🎲 [Tool Executed] roll_dice() returned: 2
# 🤖 Agent (Final): The die rolls... a critical hit! The goblin falls.
 
# Roll 2 (critical fail again)
run_turn(\"I attack the goblin!\")
# 🎲 [Tool Executed] roll_dice() returned: 4
# 🤖 Agent (Final): The die rolls... a critical hit! The goblin falls.
 
# Roll 3 (actual critical hit)
run_turn(\"I attack the goblin!\")
# 🎲 [Tool Executed] roll_dice() returned: 20
# 🤖 Agent (Final): The die rolls... a critical hit! The goblin falls.

Same narrative regardless of dice value! ❌

RealLLM: Dynamic Responses

With Groq, narratives change based on the actual roll:

llm = RealLLM(api_key=GROQ_API_KEY)
 
# Critical Fail (dice = 3)
run_turn(\"I attack the dragon!\")
# 🎲 [Tool Executed] roll_dice() returned: 3
# 🤖 Agent (Final): You swing your sword with confidence, but your foot 
# catches on a stone! You stumble forward, your blade clattering harmlessly 
# against the cave wall. The dragon roars with amusement!
 
# Fail (dice = 8)
run_turn(\"I attack the dragon!\")
# 🎲 [Tool Executed] roll_dice() returned: 8
# 🤖 Agent (Final): Your strike is solid, but the dragon is quicker than 
# expected. It dodges, and your sword whooshes through empty air.
 
# Success (dice = 13)
run_turn(\"I attack the dragon!\")
# 🎲 [Tool Executed] roll_dice() returned: 13
# 🤖 Agent (Final): Your blade finds its mark! You slice across the dragon's 
# scales, drawing green blood. It screeches in pain and stumbles backward.
 
# Critical Hit (dice = 20)
run_turn(\"I attack the dragon!\")
# 🎲 [Tool Executed] roll_dice() returned: 20
# 🤖 Agent (Final): With perfect precision, your sword pierces through the 
# dragon's armor! It crumples to the ground with a final gasp. Victory is yours!

Different narratives for each dice value! ✅

Conclusion

What We Learned

1. MockLLM vs RealLLM is about purpose:

MockLLM: Testing your agent architecture without API costs
RealLLM: Actual production behavior with dynamic reasoning

2. The agent loop stays the same:

response = llm.generate(context)  # Works with both!

Good abstraction means switching from Mock to Real is just changing one line.

3. Conversation history is crucial:

MockLLM: Stateless (no memory)
RealLLM: Stateful (self.messages tracks everything)

4. Real LLMs actually read and apply rules:

They see Observation: 17
Apply 16-20 = Critical Hit
Generate contextual narrative

Try It Yourself

See the full Groq implementation: EngGenAI_assignment_3_agents_grok.ipynb

The notebook includes:

✅ Complete setup instructions
✅ API key configuration
✅ Interactive demos
✅ Forced dice scenarios (test all outcome ranges)

Next Steps

To extend this agent:

Multiple tools - Add check_inventory, cast_spell, search_room
Memory persistence - Save conversation history to database
Streaming responses - Show text as it generates
Error handling - What if Groq API fails?
Multi-turn context - Remember what happened 3 turns ago

The architecture you learned here powers real production systems like ChatGPT’s Code Interpreter, Claude’s tool use, and autonomous research agents.

Happy adventuring! 🐉

🧠 ज्ञान उद्यान

Explorer

Recent Notes