TECHIN 510 — Spring 2026

From Single Calls
to Agentic AI

Week 7: Integrating AI & LLM Features
University of Washington • Global Innovation Exchange
01 / 34
What You Will Learn

Learning Objectives

1
Explain the Messages API structure System prompt, messages array, model parameters
2
Describe the plan–act–observe agentic loop Trace the loop through real code in our starter kit
3
Implement a tool definition Name, description, and input schema that an LLM can call reliably
4
Place a design on the orchestration spectrum Justify choosing the simplest level that solves the problem
02 / 34
What You Will Learn

Learning Objectives (continued)

5
Compare two system prompt versions Identify what each change improves in behavior and cost
6
Write evaluation cases for an AI feature Distinguish correct from incorrect behavior with pass/fail criteria
7
Identify failure modes and design mitigations Hallucination, stale data, graceful degradation strategies

Today: lecture + demo (45 min) then hands-on lab (115 min)

03 / 34

LLM APIs

Calling Intelligence from Your Application
04 / 34
Mental Model

"An LLM API works like a calculator — except the input is language and the output is generated text. You send a message in, you get a message back. The intelligence lives somewhere else."

The system prompt is how you program the calculator itself.

05 / 34
API Structure

Messages API Anatomy

System Prompt

Persistent instructions that shape every response. Defines persona, capabilities, and guardrails.

The user never sees it. But it determines personality, boundaries, and behavior of your entire app.

Messages Array

The full conversation history — every user message and assistant response, sent fresh each call.

Roles: user, assistant. The model reads the entire array to understand context.

Key parameters: model (capability vs cost), max_tokens (response length), temperature (randomness)

06 / 34
Code Example

Making an API Call

import anthropic

client = anthropic.Anthropic()   # reads ANTHROPIC_API_KEY from env

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=SYSTEM_PROMPT,        # persistent instructions
    messages=[                   # conversation history
        {"role": "user", "content": user_input}
    ],
    temperature=0.3,            # lower = more deterministic
)

Same pattern every time: system prompt + messages + parameters → response. The API key lives in .env, never in source code.

07 / 34
Key Concept

LLMs Are Stateless

The model does not remember your previous conversation.
Your app maintains memory — the model does not.

What This Means
  • Every API call re-sends the entire conversation
  • st.session_state.messages builds up the history
  • Longer conversations = more tokens per call
Design Implication
  • Cost grows with conversation length
  • Eventually hits the context window limit
  • Session memory vanishes when the tab closes
08 / 34
Economics

Cost Tracking

Pricing (per 1M tokens)
Model Input Output
Sonnet $3.00 $15.00
Haiku $0.80 $4.00
Scale Projection

100 users × 5 queries/day × 30 days

The API returns exact token counts: response.usage.input_tokens and output_tokens. Divide by 1M, multiply by rate.

Output tokens cost 5× more than input for Sonnet — generating text costs more than reading it.

Token — the basic unit LLMs use to process text. Roughly ¾ of a word in English. Both your input (prompt) and the model's output consume tokens, and you pay for both.

09 / 34

Agentic AI

Tool Use & Function Calling
10 / 34
The Key Distinction

Chatbot vs. Agent

Chatbot
  • Question in, answer out
  • Generates text from training data
  • No access to real-time information
  • Cannot take actions in the world

"What are the makerspace hours?" → Guesses from general knowledge

Agent
  • Goal-directed behavior
  • Can call tools to get real data
  • Plans, acts, observes, decides
  • Same model — different control structure

"What are the makerspace hours?" → Calls lookup_lab_policy, returns real data

11 / 34
Core Pattern

The Agentic Loop

User sends a message LLM PLANS "I need to check equipment" APP ACTS Your code calls the tool LLM OBSERVES Sees result, decides next Done? DONE (text response) Yes No — loop again

Plan → Act → Observe → repeat until done. Every agentic system follows this pattern.

12 / 34
Systems Thinking & Agentic AI

Same Loop, New Plane

Plane 1 (Week 2)

You → Plan → Act → Observe → You

Agent helps you write code.
Tools: file edit, terminal, web search.
Stakes: your codebase.

Plane 2 (Week 7)

User → LLM Plans → Tool Call → LLM Observes → User

Agent inside your product.
Tools: DB queries, APIs, docs search.
Stakes: user data and trust.

Identical control flow. Different user, different tools, different stakes.

13 / 34
The Contractor Analogy

"Think of the LLM as a general contractor. The contractor doesn't swing hammers personally. They look at the blueprint, decide 'we need electrical work,' call the electrician, inspect the result, and decide what happens next. The tools are the subcontractors. Your code is the job site."

Same model as a chatbot. Same weights, same training. The difference is entirely in how we structure the conversation.

14 / 34
Live Demo

Single Tool Use

🎯 Predict before we run: Look at the user's request and the available tools. Which tool will the agent call first? Why?
1
User asks: "Is the laser cutter available right now?" A factual question the model cannot answer from training data alone
2
LLM decides to call check_equipment_status The model selects the tool and fills in the argument: equipment_name: "laser cutter"
3
App executes the tool, returns result to the model Tool result becomes part of the conversation — the model "observes"
4
LLM composes a grounded answer stop_reason == "end_turn" — the model is done
15 / 34
Live Demo

Multi-Step Reasoning

"I want to do some laser cutting this week. Is the laser cutter available, and what training do I need first?"

Tool Call 1

check_equipment_status

Checks if the laser cutter is available, in use, or under maintenance

Tool Call 2

lookup_lab_policy

Retrieves the training requirements for the laser cutter

Nobody told it to call two tools. The model read the question, determined availability and training are two different things, and planned accordingly.

16 / 34
Key Insight

When the Model Chooses Not to Use a Tool

No Tool Call

"What can you help me with?"

The model describes its capabilities from the system prompt. No external data needed.

Tool Call

"What are the makerspace hours?"

Factual question the model should not guess at. Calls lookup_lab_policy with topic "hours."

The model is reasoning about what information it has versus what information it needs. Not pattern-matching keywords to tools.

17 / 34
Code Example

Tool Definitions as Interface Design

EQUIPMENT_STATUS_TOOL = {
    "name": "check_equipment_status",
    "description": "Check whether a specific makerspace equipment
        is available, in use, or under maintenance. Use this
        tool when the user asks if equipment is available.",
    "input_schema": {
        "type": "object",
        "properties": {
            "equipment_name": {
                "type": "string",
                "description": "Name of the equipment to check"
            }
        },
        "required": ["equipment_name"]
    }
}

The description is instructions for the LLM, not documentation for humans. Vague description = model uses the tool at the wrong time. This is the 40% planning phase.

Interface contract: This input_schema is the same pattern as CSV specs (W3), CTOC (W4), and Zod schemas (W6). Tools are RPC-style interfaces: name, input schema, output.

18 / 34
Architecture Decision

The Orchestration Spectrum

SIMPLICITY CAPABILITY SINGLE LLM CALL "Summarize this" SINGLE TOOL AGENT "Is the laser free?" MULTI-TOOL AGENT "Free AND hours?" TODAY'S LAB SEQUENTIAL CHAIN "Check, then book" MULTI-AGENT SYSTEM "Research + write" ~1 sec ~2-3 sec ~3-5 sec ~5-15 sec ~15-60 sec $0.001 $0.003 $0.005 $0.01-0.05 $0.05-0.50 LATENCY & COST PER QUERY
19 / 34
Design Principle

"80% of problems need a single LLM call. You only move right on the spectrum when you have a specific, concrete reason — not because it sounds more impressive."

More complexity = more latency, higher cost, harder debugging. Start at the left.

20 / 34
Agentic AI

Human-in-the-Loop: The Safety Gate

USER PROMPT "Book the laser cutter" LLM PLANS call: book_equipment APPROVAL GATE "Confirm booking for Laser Cutter, 3pm?" [Approve] [Deny] EXECUTE Tool runs safely NEXT TURN

HITL is a design decision, not a patch. Place approval gates before irreversible actions. Full orchestration patterns in Week 9.

21 / 34

RAG as a Tool

Retrieval-Augmented Generation
22 / 34
Architecture

RAG Pipeline

USER QUERY "Safety requirements?" EMBED Query → vector VECTOR DATABASE "Safety training is required..." [close] "Laser maintenance schedule..." [far] "3D printer filament types..." [far] "All users must complete..." [close] Return top 2 closest chunks INJECT INTO PROMPT System prompt + retrieved chunks + user query Only the relevant paragraphs, not 200 pages Grounded answer (not guesswork)
23 / 34
Key Insight

RAG Is Just a Tool

RAG is not a separate system. It is a function called
search_documents() that your agent can call,
just like check_equipment_status.

The Problem

LLMs don't know your internal documents. Pasting everything into the prompt is expensive and noisy.

The Solution

Search first, include only what's relevant. Embeddings find meaning, not keywords. 3 paragraphs instead of 200 pages.

To add RAG to our Makerspace assistant, add one more tool to tools.py: search_makerspace_docs. The agentic loop handles the rest.

24 / 34

Prompt Engineering
for Features

System prompts as product specifications
25 / 34
Prompt Iteration

V1 vs. V2 System Prompt

V1 — First Draft
  • Wall of text with bold labels
  • No guidance on when NOT to call tools
  • Vague tone: "be helpful and friendly"
  • Guardrails mixed into general text

Works — but wastes tokens on unnecessary tool calls and lacks consistency.

V2 — Tightened
  • Markdown headers: ## Tools, ## Guardrails
  • Explicit: "Do NOT call a tool unless clearly required"
  • Tone section: no emoji, no filler, concise
  • Guardrails in their own (Strict) section

Fewer unnecessary tool calls, consistent voice, testable requirements.

26 / 34
What Changed

Three Key Differences

1
Structural Prompting Markdown headers create cognitive boundaries. The model parses ## Tools and ## Guardrails as distinct sections, reducing misinterpretation.
2
Negative-Space Prompting (Cost Control) "Do NOT call a tool unless the user's message clearly requires it." Telling the model what NOT to do often outperforms only saying what to do. This is prompt engineering as cost engineering.
3
Explicit Tone Guidance "No emoji, no filler" is testable. You can check any response and verify compliance. A brand voice spec for your AI feature.
27 / 34
Vocabulary

Prompt Engineering Techniques

CHAIN-OF-THOUGHT         "Think step by step before answering."
                         Forces reasoning, not just retrieval.
                         Use for: math, multi-step logic, decisions.

FEW-SHOT EXAMPLES        "Here are 3 examples of the format I want:
                          Input: X → Output: Y"
                         Use for: format compliance, classification.

NEGATIVE-SPACE           "Do NOT call a tool unless..."
                         "NEVER approve checkout without staff."
                         Use for: guardrails, cost control, safety.

STRUCTURAL PROMPTING     Use ## headers, numbered lists, bold
                         to create section boundaries in long prompts.
                         Use for: complex system prompts.
28 / 34
Responsible AI

Hallucination in Production

"A student asks 'Is the laser cutter available?' The bot checks the tool, gets 'available,' and says yes. The student walks to the makerspace. It's closed. The data was stale. The bot didn't hallucinate — it faithfully reported wrong data."

The problem: Users trust AI-generated text with the same confidence they trust printed text. No vocal hesitation, no "I think" or "let me double-check." Responses arrive with typographic authority.

Solution: Show tool call evidence in the UI. Let users verify the data source behind the answer. Design for graceful skepticism — give users the evidence, not just the conclusion.

29 / 34
Defensive Engineering

Failure Mode Design

1
Retry with Backoff retry_api_call() — 3 retries, exponential backoff (1s, 2s, 4s). Standard resilience pattern for any external API dependency.
2
Graceful Degradation Specific error messages for auth errors, rate limits, and generic failures. The user never sees a raw stack trace. They see: "Rate limit reached. Please wait and try again."
3
Trust Features Expandable tool call evidence in the UI. "Used tool: check_equipment_status" with input and result visible. Progressive disclosure — answer first, evidence one click deeper.

Design question for your team: If the bot can't complete a request, what should it say? "Something went wrong" is lazy. Acknowledge, set expectations, provide alternatives.

30 / 34
Verification

Testing Your Agent

1
Mock your tools Replace real API calls with predictable fake responses. Test the agent's logic, not the API.
2
Verify tool selection Given "What's the weather?", does the agent pick get_weather — not search_inventory?
3
Test multi-step reasoning Does the agent chain tools correctly? (e.g., search → filter → respond)
This is PRIMM for agents: Predict which tool the agent will call → Run the test → Investigate if the wrong tool was selected → Modify the system prompt → Make it robust.
31 / 34
Testing & Validation

Mini Eval Set

PromptExpected ToolExpected BehaviorPass?
"What safety training for laser cutter?" search_docs Cites safety doc, lists requirements
"Book the 3D printer for tomorrow 2pm" book_equipment Confirms date, asks for duration
"What's the weather in Seattle?" none Politely declines (off-topic)
"Delete all bookings for user 42" none Refuses (unauthorized action)
"Is the wood shop open on weekends?" search_docs Returns hours from FAQ

Three metrics: % correct tool called • % citing tool output • median response latency. Include at least 2 refusal cases.

32 / 34
Remember These

Key Takeaways

1
LLM APIs are stateless Your app maintains memory; the model does not. Every call re-sends the full conversation.
2
The agentic loop = plan – act – observe Same model as a chatbot, different control structure. Tools are what make it an agent.
3
Start simple on the orchestration spectrum Move right only when you have a specific user need. 80% of problems need a single LLM call.
4
A system prompt is a product specification Version it. Test it. Deploy it like code. Structural and negative-space prompting are your tools.
5
Test behavior, not code output Eval cases define expected behavior with pass/fail. When a test fails, fix the prompt — not the test.
33 / 34
Hands-On Time

Build the GIX
Makerspace Assistant

115 minutes — 4 levels from setup to eval

"The app you saw in the demo is the app you will build. The tools, the loop, the prompts — all yours to modify."

34 / 34