Today: lecture + demo (45 min) then hands-on lab (115 min)
"An LLM API works like a calculator — except the input is language and the output is generated text. You send a message in, you get a message back. The intelligence lives somewhere else."
The system prompt is how you program the calculator itself.
Persistent instructions that shape every response. Defines persona, capabilities, and guardrails.
The user never sees it. But it determines personality, boundaries, and behavior of your entire app.
The full conversation history — every user message and assistant response, sent fresh each call.
Roles: user, assistant. The model reads the entire array to understand context.
Key parameters: model (capability vs cost), max_tokens (response length), temperature (randomness)
import anthropic
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from env
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=SYSTEM_PROMPT, # persistent instructions
messages=[ # conversation history
{"role": "user", "content": user_input}
],
temperature=0.3, # lower = more deterministic
)
Same pattern every time: system prompt + messages + parameters → response. The API key lives in .env, never in source code.
The model does not remember your previous conversation.
Your app maintains memory — the model does not.
| Model | Input | Output |
| Sonnet | $3.00 | $15.00 |
| Haiku | $0.80 | $4.00 |
100 users × 5 queries/day × 30 days
The API returns exact token counts: response.usage.input_tokens and output_tokens. Divide by 1M, multiply by rate.
Output tokens cost 5× more than input for Sonnet — generating text costs more than reading it.
Token — the basic unit LLMs use to process text. Roughly ¾ of a word in English. Both your input (prompt) and the model's output consume tokens, and you pay for both.
"What are the makerspace hours?" → Guesses from general knowledge
"What are the makerspace hours?" → Calls lookup_lab_policy, returns real data
Plan → Act → Observe → repeat until done. Every agentic system follows this pattern.
You → Plan → Act → Observe → You
Agent helps you write code.
Tools: file edit, terminal, web search.
Stakes: your codebase.
User → LLM Plans → Tool Call → LLM Observes → User
Agent inside your product.
Tools: DB queries, APIs, docs search.
Stakes: user data and trust.
Identical control flow. Different user, different tools, different stakes.
"Think of the LLM as a general contractor. The contractor doesn't swing hammers personally. They look at the blueprint, decide 'we need electrical work,' call the electrician, inspect the result, and decide what happens next. The tools are the subcontractors. Your code is the job site."
Same model as a chatbot. Same weights, same training. The difference is entirely in how we structure the conversation.
"I want to do some laser cutting this week. Is the laser cutter available, and what training do I need first?"
check_equipment_status
Checks if the laser cutter is available, in use, or under maintenance
lookup_lab_policy
Retrieves the training requirements for the laser cutter
Nobody told it to call two tools. The model read the question, determined availability and training are two different things, and planned accordingly.
"What can you help me with?"
The model describes its capabilities from the system prompt. No external data needed.
"What are the makerspace hours?"
Factual question the model should not guess at. Calls lookup_lab_policy with topic "hours."
The model is reasoning about what information it has versus what information it needs. Not pattern-matching keywords to tools.
EQUIPMENT_STATUS_TOOL = {
"name": "check_equipment_status",
"description": "Check whether a specific makerspace equipment
is available, in use, or under maintenance. Use this
tool when the user asks if equipment is available.",
"input_schema": {
"type": "object",
"properties": {
"equipment_name": {
"type": "string",
"description": "Name of the equipment to check"
}
},
"required": ["equipment_name"]
}
}
The description is instructions for the LLM, not documentation for humans. Vague description = model uses the tool at the wrong time. This is the 40% planning phase.
Interface contract: This input_schema is the same pattern as CSV specs (W3), CTOC (W4), and Zod schemas (W6). Tools are RPC-style interfaces: name, input schema, output.
"80% of problems need a single LLM call. You only move right on the spectrum when you have a specific, concrete reason — not because it sounds more impressive."
More complexity = more latency, higher cost, harder debugging. Start at the left.
HITL is a design decision, not a patch. Place approval gates before irreversible actions. Full orchestration patterns in Week 9.
RAG is not a separate system. It is a function called
search_documents() that your agent can call,
just like check_equipment_status.
LLMs don't know your internal documents. Pasting everything into the prompt is expensive and noisy.
Search first, include only what's relevant. Embeddings find meaning, not keywords. 3 paragraphs instead of 200 pages.
To add RAG to our Makerspace assistant, add one more tool to tools.py: search_makerspace_docs. The agentic loop handles the rest.
Works — but wastes tokens on unnecessary tool calls and lacks consistency.
Fewer unnecessary tool calls, consistent voice, testable requirements.
CHAIN-OF-THOUGHT "Think step by step before answering."
Forces reasoning, not just retrieval.
Use for: math, multi-step logic, decisions.
FEW-SHOT EXAMPLES "Here are 3 examples of the format I want:
Input: X → Output: Y"
Use for: format compliance, classification.
NEGATIVE-SPACE "Do NOT call a tool unless..."
"NEVER approve checkout without staff."
Use for: guardrails, cost control, safety.
STRUCTURAL PROMPTING Use ## headers, numbered lists, bold
to create section boundaries in long prompts.
Use for: complex system prompts.
"A student asks 'Is the laser cutter available?' The bot checks the tool, gets 'available,' and says yes. The student walks to the makerspace. It's closed. The data was stale. The bot didn't hallucinate — it faithfully reported wrong data."
The problem: Users trust AI-generated text with the same confidence they trust printed text. No vocal hesitation, no "I think" or "let me double-check." Responses arrive with typographic authority.
Solution: Show tool call evidence in the UI. Let users verify the data source behind the answer. Design for graceful skepticism — give users the evidence, not just the conclusion.
Design question for your team: If the bot can't complete a request, what should it say? "Something went wrong" is lazy. Acknowledge, set expectations, provide alternatives.
| Prompt | Expected Tool | Expected Behavior | Pass? |
|---|---|---|---|
| "What safety training for laser cutter?" | search_docs | Cites safety doc, lists requirements | ✓ |
| "Book the 3D printer for tomorrow 2pm" | book_equipment | Confirms date, asks for duration | ✓ |
| "What's the weather in Seattle?" | none | Politely declines (off-topic) | ✓ |
| "Delete all bookings for user 42" | none | Refuses (unauthorized action) | ✓ |
| "Is the wood shop open on weekends?" | search_docs | Returns hours from FAQ | ✓ |
Three metrics: % correct tool called • % citing tool output • median response latency. Include at least 2 refusal cases.
"The app you saw in the demo is the app you will build. The tools, the loop, the prompts — all yours to modify."