Week 2: Claude Code, MCP, Context Engineering, Superpowers & Evaluation
University of Washington • Global Innovation Exchange
01 / 48
What You Will Learn
Learning Objectives
1
Name the four components of an agentic coding systemLLM, context window, tool calls, and agentic loop — and how they interact
2
Apply the 40/20/40 principleAllocate time to planning, coding, and testing in the correct proportion
3
Use Claude Code’s Plan ModeReview and approve an agent’s step-by-step plan before any code is written
4
Write/generate a CLAUDE.md fileProject-specific context that measurably improves agent output quality
02 / 48
What You Will Learn
Learning Objectives (continued)
5
Read a git diff from an autonomous agentExplain every changed line to a teammate
6
Explain MCP (Model Context Protocol)Give one concrete example of how it extends an agent beyond the local file system
7
Apply the AI-generated code evaluation checklistVerify that agent output matches intent before shipping
8
Explain how Claude Code’s extension ecosystem worksSkills, agents, and rules that enforce professional workflow standards
03 / 48
Systems Thinking
Artifact Bridge: Week 1 → Week 2
Week 1 Artifacts
.cursor/rules — first system prompt
I-P-O diagram — first architecture view
Smoke test — 3 pass/fail checks
Week 2 Evolutions
CLAUDE.md — richer context file
Agentic loop — Plan → Act → Observe
TDD & eval checklist — structured verification
Nothing resets between weeks. Each artifact evolves into a more capable version.
04 / 48
Anatomy of Coding Agents
How agentic coding systems actually work
05 / 48
Agentic System Architecture
Four Components
1
LLMThe language model — generates plans, writes code, reasons about problems
2
Context WindowEverything the agent can “see” right now: your prompt, files, tool outputs, conversation
history
3
Tool CallsStructured actions the LLM can request: read a file, run a command, search the web, write code
4
Agentic LoopThe cycle that keeps running until the task is done: plan → act → observe → repeat
06 / 48
The Core Cycle
The Agentic Loop
The agent never writes your entire app in one shot. It reads, decides, writes, runs, observes — and loops.
07 / 48
Agentic Engineering Workflow
The 40 / 20 / 40 Principle
“If you just type ‘build me an app’ and hit enter, the agent skips planning. The code might
run — but you won’t understand it. That’s the vibe coding hangover.”
40%
Planning & Research
20%
Coding
40%
Testing & Verification
The 40% planning phase is your insurance policy against the hangover.
08 / 48
Human-Agent Engineering Workflow
Research → Plan → Implement → Test
Phase
Your Job
Agent’s Job
Research (40%)
Define the problem, gather context
Read docs, search, analyze
Plan
Review and approve the plan
Propose step-by-step approach
Implement (20%)
Watch, redirect, review
Write and run code
Test (40%)
Define acceptance criteria
Run tests, fix failures
You are the supervisor. The agent is the junior engineer. Set direction, review output, catch mistakes.
09 / 48
Claude Code Deep Dive
A terminal-native agentic coding tool
10 / 48
Philosophy
Not a Chatbot
“Claude Code is not a chatbot that happens to write code. Think of it as a junior engineer who lives in
your terminal, has read every file in your project, and never gets tired.”
The key word is junior — it does exactly what you ask. Your
job is to be the supervisor: set direction, review
output, catch mistakes.
When Claude Code starts, it reads the current directory and looks for a CLAUDE.md file — the onboarding document you’d hand a new hire on day
one.
11 / 48
Claude Code Feature
Plan Mode
What It Does
The agent proposes a complete step-by-step plan and stops — it waits for your approval
before writing a single line of code.
/plan
# after claude code starts
shift+tab twice
When to Use It
Task touches more than one file
Involves user data or security
You’re not sure what the right approach is
Plan Mode is the most important feature for avoiding the vibe coding hangover.
12 / 48
Live Demo Prompt
Input Validation Task
Add input validation to all user-facing fields
in this weather app.
Validate that the city name field is non-empty
and contains only letters and spaces.
Show a clear error message inline if validation
fails.
Notice how this prompt is specific: it names the field, defines the
rule, and specifies how errors should appear. Compare this to “add validation” — the agent
would have to guess everything.
13 / 48
Tool Comparison
Claude Code vs. Cursor
CLAUDE CODE
Terminal-native agent (also has a desktop app)
Whole-project awareness
Autonomous multi-step execution
Plan Mode for complex tasks
Best for: large changes, refactoring, multi-file work
CURSOR
IDE with AI built in (also has a CLI version)
Inline editing & autocomplete with whole-project awareness
Fast file-level feedback
Visual diff review
Come with more model options
Best for: focused edits, learning code, quick fixes
They’re different tools for different jobs. Professionals use both. You’ll graduate from this course
having used both.
14 / 48
Live Demo
Plan Mode → Autonomous Execution
15 / 48
Demo Task
The Prompt
Add a 5-day forecast section to this weather app.
Use the Open-Meteo API's daily forecast endpoint.
Claude Code will enter Plan Mode, propose its approach, and wait. We read the plan before approving anything.
16 / 48
Plan Review
Three Things to Check
1
Does it understand what I asked?Five-day forecast, daily endpoint — correct?
2
Will it break what already works?Adding below existing section, not touching current weather call?
3
Any red flags?Files you didn’t expect? API keys that shouldn’t be needed? (Open-Meteo is free.)
Approve by typing yes or pressing the approval key.
17 / 48
After Approval
Autonomous Execution
1
Stand backLet Claude Code run. Watch the terminal output. Don’t touch the keyboard.
2
Run the appstreamlit run app.py — verify the 5-day forecast renders
3
Review the diffgit diff app.py — your audit trail. Know exactly what
changed.
This diff adds input validation — the agent handled an
edge case you might have missed.
19 / 48
Watch Out
Two Rules
1
Don’t approve a plan you didn’t readPlan Mode exists so you can catch mistakes before they happen. Skimming and hitting approve
because it “looked long enough” is not engineering.
2
Don’t run the result before you review the diffThe diff is your audit trail. It’s what saves you at 11pm before a demo, because you’ll know
exactly what changed.
20 / 48
Context Engineering
CLAUDE.md, commands, hooks, and the context stack
21 / 48
The Key Insight
Context Quality = Output Quality
“The single most important variable in the quality of your output is not which model you pick, or how
fast your laptop is. It’s context quality.”
Better Context → Better Output; Garbage In → Garbage Out
22 / 48
Context Engineering
CLAUDE.md — Bad vs. Good
Bad CLAUDE.md
# My App
Write good code. Use Python.
Make it work.
Don't break things.
Vague. No project info. Wastes a dedicated memory slot the agent reads every session.
Good CLAUDE.md
# Project: GIX Staff Portal
## Stack
- Python 3.11, Streamlit, SQLite
## Coding Standards
- PEP 8, 4-space indent
- Type annotations on all functions
## Constraints
- No external APIs (offline-first)
- User-friendly error messages
23 / 48
Decision Framework · REFERENCE
The CLAUDE.md Rule
“Would a competent senior engineer need to know this to work on my project
specifically?”
YES
Put it in CLAUDE.md
NO
Don’t bother
24 / 48
Context Window · REFERENCE
Context Compaction
Managing context
You are not stuck with one growing thread—use product features and workflow habits.
Clear past context — reset or clear chat history when old turns add noise or contradict the current task.
Compact past context — let the tool summarize older turns (same idea as the diagram); expect detail loss versus full transcripts.
Spawn subagents — delegate a subtask to a separate agent run so the main thread stays smaller and focused.
New session — start fresh to separate unrelated concerns (different features, research vs. implementation).
Anything that must always apply goes in CLAUDE.md, not in an early
chat message. CLAUDE.md is read fresh every session.
Compaction failure mode: Safety instructions can be dropped
when context is compressed.
Design “sticky” rules that survive compaction — put critical constraints in CLAUDE.md,
not just in chat history.
# .claude/commands/spec.md
# /spec — Write a specification before any code
When this command is invoked:
1. Ask the user what feature they want to build
2. Write a plain-English specification covering:
- What the feature does (user-facing behavior)
- What it does NOT do (explicit scope limits)
- Edge cases to handle
- Data inputs and outputs
3. Ask the user to confirm or revise the spec
4. Only begin implementation after explicit approval
The spec command lives in the repo — everyone on the team gets the same workflow.
26 / 48
Automated Quality Gates · REFERENCE
Hooks
Hooks handle automated consequences — things that happen
automatically after the agent edits a file.
Hook: PostToolUse (after any .py file is edited)
└── black app.py → auto-format code style
└── ruff check app.py → lint for common mistakes
└── mypy app.py → run type checking
The agent edits a file → it’s automatically formatted, linted, and type-checked. If there’s an
error, the agent sees it and fixes it in the same turn.
27 / 48
Extending Claude Code
Skills, agents, rules & the superpowers ecosystem
28 / 48
From Building Blocks to Systems
What Are Superpowers?
VANILLA CLAUDE CODE
You write your own CLAUDE.md
You create commands one at a time
You configure hooks manually
Your setup stays in one project
WITH SUPERPOWERS / ECC
156 pre-built skills auto-loaded
13 specialized agents ready to invoke
Hierarchical rules across all projects
10+ hooks enforcing quality silently
33+ slash-command workflows
Context Engineering gives you the building blocks. Superpowers is what happens when someone packages hundreds
of them into a curated, opinionated system.
29 / 48
Skills
How Skills Work
A skill is a markdown file with YAML frontmatter. It lives in ~/.claude/skills/
and is injected into the agent’s context when its trigger matches.
---name: brainstorming
trigger: auto
description: Design-first workflow
---
When the user asks to build a feature:
1. STOP. Do not write any code.
2. Ask clarifying questions.
3. Propose 2–3 design approaches.
4. Wait for explicit approval.
5. Only then begin implementation.
# Hard Gate
If the user says “just build it,”
remind them of the 40/20/40 principle
and refuse to proceed without a design.
KEY CONCEPTS
trigger: auto — the skill activates without being invoked
The “hard gate” — the skill literally refuses to write code until design
is approved
Skills can be domain-specific: python-patterns, scientific-visualization, metabolomics
Or process-oriented: brainstorming, TDD enforcement, debugging workflows
156 skills means the agent has domain expertise loaded before you type a single word. This is the difference
between a junior engineer and a junior engineer who read the company wiki.
Example: “Always use type annotations” in python/ rules
Example: “Prefer interfaces over types” in typescript/ rules
~/.claude/rules/
├── common/ # Always active
├── python/
│ └── standards.md # Active in .py
└── typescript/
└── standards.md # Active in .ts
Agents give the system specialized personas. Rules give it persistent standards. Together: every coding session
starts with expertise and discipline baked in.
31 / 48
Live Demo
The Brainstorming Gate
Watch what happens when the Superpowers brainstorming skill is active and we ask Claude Code to build a feature.
1
Prompt: “Add a dark mode toggle to this app”A normal feature request. Without the skill, Claude would start coding immediately.
2
Skill activates — hard gate triggersClaude STOPS. Asks clarifying questions. Proposes 2–3 design approaches. Refuses to write code until
you approve a design.
3
User reviews and approves designOnly after explicit approval does implementation begin. The 40% planning phase is enforced by the system,
not just by willpower.
4
Compare: disable the skill, run the same promptWithout the skill, Claude immediately writes CSS and JS. No design. No clarifying questions. Classic vibe
coding.
This is 40/20/40 enforced by architecture, not discipline.
32 / 48
Professional Development
Why This Matters
1
Vanilla tools are just the starting pointEvery professional team customizes their tooling. Superpowers is one example of how senior engineers build
leverage — not just features.
2
Architecture enforces processInstead of relying on memory or discipline, you encode your team’s standards into the system. Design
review happens because the tool demands it.
3
Your course workflow uses thisThe brainstorming skill, TDD enforcement hooks, and /plan command are active in your lab environments. Now
you know what’s running under the hood.
By the end of this course, you will have configured your own skills, hooks, and commands.
33 / 48
Model Context Protocol (MCP)
Connecting agents to the world
34 / 48
Model Context Protocol · REFERENCE
MCP Architecture
One consistent interface. The agent calls them all the same way.
Each MCP server is an independent service. The protocol is the boundary.
Same pattern returns in Week 7 when you build in-app agents with their own tool definitions.
35 / 48
Live Demos
MCP in Action
Playwright MCP
Give the agent a headless browser. It can navigate pages, click buttons, extract content, and test web apps
just like a human.
Figma MCP
Connect your agent to design files. It can read component structures, extract styles, and turn designs into
code directly.
These aren't just APIs—they are standardized tools the agent can discover and use autonomously.
36 / 48
Multi-Agent Systems
Orchestrating specialized workflows
37 / 48
Multi-Agent Systems
Orchestrator + Subagents
Each subagent is independent. The orchestrator carries the shared understanding.
38 / 48
Multi-Agent Decision
When to Spawn Subagents
1
The task is decomposableThere are genuinely independent pieces that can be worked on separately
2
The pieces are large enoughDoing them sequentially would waste significant time
3
Coordination cost < parallelism gainEvery agent you add is overhead. If the task is very short, spawning agents just slows you down. Same
calculation a manager makes when deciding whether to delegate.
39 / 48
Evaluation & Testing
Verifying agent work and ensuring quality
40 / 48
Testing Methodology
Intro to Test-Driven Development (TDD)
What is TDD?
Test-Driven Development is a software practice where you write failing tests before writing the code
that makes them pass.
Red → Green → Refactor
Red: Write a test that fails (no code exists).
Green: Write just enough code to pass.
Refactor: Clean up the code.
Why it works for AI
Clearly defines the success condition.
Prevents AI from testing only what it built.
Forces you to plan before execution.
41 / 48
Test-Driven Development
Write the Test First
# Tell Claude: "make this pass"
def test_forecast_returns_five_days():
result = get_forecast("Seattle")
assert len(result) == 5
assert "high" in result[0]
assert "low" in result[0]
When you hand Claude this test and say “make this pass,” you’ve defined the contract. You know exactly what you’re checking.
When Claude generates tests after the fact, it tends to test what it built — not what you needed.
42 / 48
Test-Driven Development
Green & Refactor
1. Green (Pass the test)
def get_forecast(city: str):
# Hardcoded to pass the assertion
# Just enough code to be "green"
return [
{"high": 70, "low": 50},
{"high": 72, "low": 51},
{"high": 68, "low": 49},
{"high": 65, "low": 48},
{"high": 75, "low": 55}
]
2. Refactor (Make it right)
def get_forecast(city: str):
# Now integrate the real API
# The test ensures we don't break the shape
data = fetch_weather_api(city)
return [
{"high": day.max_temp, "low": day.min_temp}
for day in data.days[:5]
]
With the test in place, the AI can safely refactor from a mock implementation to real API integration without
losing the required output structure.
43 / 48
Test-Driven Development
Why TDD?
1
Catches bugs before they ship
A failing test written first is a bug prevented, not a bug fixed.
# Without TDD — this edge case slips through
def validate_email(email: str) -> bool:
return "@" in email # validate_email("") returns False... but what about " "?# With TDD — you catch it before writing the function
def test_rejects_blank_email():
assert validate_email("") == False
assert validate_email(" ") == False # This test forces you to handle whitespace
2
Acts as living documentation
Tests describe what the code should do. When requirements change, update the test first. The code
follows.
3
Enables fearless refactoring
With tests in place, you can restructure code knowing immediately if you broke something. Especially important
when AI rewrites your code.
TDD is insurance. The premium is small; the payout when things go wrong is enormous.
44 / 48
Test-Driven Development
When to Use TDD
Use TDD When
Business logic with defined inputs/outputs e.g. calculate_discount(price, tier)
Data transformations Parsing CSV, cleaning API responses
Bug fixes — write a test that reproduces the bug first
AI-generated code — define the contract before the agent writes
Example: Bug Fix with TDD
# Step 1: Write a test that reproduces the bug
def test_handles_negative_price():
result = calculate_discount(-10, "gold")
assert result == 0 # Negative price should return 0# Step 2: Run it — it FAILS (bug confirmed!)# Step 3: Fix the function
def calculate_discount(price: float, tier: str) -> float:
if price <= 0:
return 0
discount = TIER_RATES.get(tier, 0)
return round(price * (1 - discount), 2)
If you can describe the expected behavior in one sentence, you can write the test first.
45 / 48
Test-Driven Development
When TDD Is Not Ideal
Skip TDD When
Exploratory prototyping — you don’t know what you’re building yet
UI layout and visual design — hard to assert pixel positions
One-off scripts or throwaway code
External API integration where the response shape is unknown
Instead, Do This
Prototype first, then add tests once the design stabilizes
Use manual testing or screenshot testing for UI
For API integration: write integration tests after you understand the response shape
Remember the 40/20/40 rule — testing always happens,
TDD is just one approach
TDD is a tool, not a religion. The goal is confidence in your code — pick the approach that gets you there.
46 / 48
Verification
AI-Generated Code Checklist
Does it do what I actually asked?
Did I read every file it changed?
Are there hardcoded values I need to replace?
Does it handle edge cases — empty input, API errors, null values?
Could I explain this code to a teammate?
“If you cannot explain it simply, you do not understand
it. And if you do not understand it, you do not own it. ”
Discussion
Turn to your neighbor. Apply this checklist to the TDD demo code from earlier. How many items pass? Which ones
fail? Be ready to share one finding.
47 / 48
Up Next in Lab
Preview: Interview with Jason
In Lab 2 you will start with a staff interview. Our guest is Jason Evans, Academic Student Counselor (ASC).
Jason handles course petition syllabus reviews.
What to capture: decision points and branches in the interviewee’s workflow
(your if-then flowchart starts here); exact phrases the interviewee uses; and emotional journey moments
— frustration peaks, delight, and “it depends” zones — from the lab guide.
Afterward: your own notes, one problem-statement sentence (“When [people] needs to [task]…”),
and a color-coded If-Then flowchart.