TECHIN 510 — Spring 2026

Agent Orchestration,
Code Review & Project Polish

Week 9: From Working Code to Compelling Demo
University of Washington • Global Innovation Exchange
01 / 36
What You Will Learn

Learning Objectives

1
Distinguish agentic coding from agentic AI Using the shared plan-tools-observe-iterate loop (Understand)
2
Diagram three orchestration patterns Pipeline, Coordinator, Human-in-the-Loop — and identify real-world analogues (Understand)
3
Apply the decision tree Place your own AI feature on the orchestration spectrum (Apply)
02 / 36
What You Will Learn

Learning Objectives (continued)

4
Evaluate AI coding tools using a 7-dimension framework Claude Code, Cursor, Devin, OpenHands (Evaluate)
5
Design a Demo Day narrative arc Problem, Solution, Demo, Reflection — with a named "wow moment" (Create)

Today: lecture (40 min) → live code review demo → hands-on lab (100 min)

03 / 36

Two Planes,
One Loop

Agentic Coding vs. Agentic AI
04 / 36
Conceptual Foundation

Two Planes of Agentic AI

PLANE 1: AGENTIC CODING (DEV WORKFLOW) YOU <--> Claude Code / Cursor User of agent: YOU (the developer) Goal: Build better software faster Tools: Glob, Grep, Edit, Write, Bash Context: CLAUDE.md / .cursorrules SAME MECHANISM PLANE 2: AGENTIC AI (PRODUCT FEATURE) YOUR USER <--> Your App's AI Feature User of agent: YOUR CUSTOMERS Goal: Solve user problems at scale Tools: check_equipment, lookup_policy Context: prompts.py / system prompt
05 / 36
Both Planes

The Shared Loop

LLM PLANS APP ACTS (tool call) LLM OBSERVES DONE ITERATE "I need to find files / check status" Executes Grep / check_equipment Reads result, decides next step

Plan → Act → Observe → Iterate. Identical loop on both planes. The JSON structure is the same.

06 / 36
Side by Side

Same Structure, Different Plane

Plane 1 — Claude Code
  • Glob — search file system
  • Grep — find patterns in code
  • Edit — modify a file
  • Bash — run terminal commands

User: the developer. Goal: build software.

Plane 2 — Week 7 App
  • check_equipment_status
  • lookup_lab_policy
  • get_hours
  • search_inventory

User: the customer. Goal: solve their problem.

Same JSON tool-use API. Same name, description, input_schema. Different plane.

07 / 36
Context Engineering

CLAUDE.md = System Prompt for Plane 1

Plane 1: CLAUDE.md

Tells Claude Code who you are, your tech stack, build commands, file structure, and conventions.

No CLAUDE.md = generic help.
Detailed CLAUDE.md = project-aware help.

Plane 2: prompts.py

Tells your app's AI feature its role, constraints, tone, available tools, and guardrails.

Vague system prompt = unreliable behavior.
Detailed system prompt = consistent behavior.

Context engineering is prompt engineering for your development tools. Both planes use the same instruction mechanism: structured text that shapes the agent's behavior.

08 / 36
PRIMM Checkpoint

Plane Check

Show of hands — Plane 1 or Plane 2?

1
You ask Claude Code to refactor a function in your codebase Plane 1 — the user of the agent is the developer
2
Your deployed app uses Claude to summarize a document for the end user Plane 2 — the user of the agent is the customer
3
GitHub Copilot suggests a line of code as you type Plane 1 — the user of the agent is the developer

The test: Who is the user of this AI interaction? Developer = Plane 1. Customer = Plane 2.

09 / 36
Systems Thinking

Everything Connects

ARTIFACTS PIPELINES VERIFICATION .cursorrules (W1) CLAUDE.md (W2) System Prompt (W7) Eval Sets (W9) I-P-O (W1) 9-Stage Pipeline (W3) 3-Tier Architecture (W5) Orchestration (W9) Smoke Tests (W1) Asserts + RLS (W4–W6) Automated Tests (W8) Agent Evals (W9) 40/20/40 PRIMM Two Planes Contracts at Boundaries
10 / 36

Orchestration
Patterns

Architectural Vocabulary for AI Systems
11 / 36
Week 7 Recap

The Orchestration Spectrum

SIMPLICITY CAPABILITY SINGLE LLM CALL SINGLE-TOOL AGENT MULTI-TOOL AGENT PIPELINE COORDINATOR "Summarize this" "Is the laser free?" "Free + hours?" "Extract > classify" "Research + write" YOUR FINAL PROJECTS LIVE HERE

80% of AI features are single LLM calls. Start on the left. Move right only when you have a concrete reason.

12 / 36
Pattern 1

Pipeline

INPUT Stage 1: Extract entities from document Stage 2: Enrich with external data Stage 3: Format generate report OUTPUT

Assembly line. Data flows through a strict sequence. Output of stage N = input of stage N+1.

Deterministic: you know exactly which stages run and in what order.

Tradeoff: fragile — if stage 2 fails, stage 3 never runs.

Real example: Contract processing — extract names/dates, cross-reference database, generate compliance report.

13 / 36
Pattern 2

Coordinator

USER REQUEST COORDINATOR Search Specialist A Write Specialist B Review Specialist C SYNTHESIZE RESPONSE

Project manager. One LLM decides who to call. Specialists do the work. Coordinator synthesizes results.

Fan-out / fan-in. Independent specialists can run in parallel. Dependent ones must wait.

Real example: Research assistant — search agent finds papers, analyst extracts findings, writer composes summary.

14 / 36
Pattern 3

Human-in-the-Loop

USER REQUEST AI PROPOSES HUMAN APPROVE EXECUTE RESULT REVISE

Stamp-and-sign workflow. AI proposes, human approves or revises. Only then does the system execute.

When you need it:

  • Action is irreversible (booking, deleting, sending)
  • Stakes are high (medical, legal, financial)
  • Regulatory requirements mandate oversight

Latency is a feature, not a bug. It is the moment where human judgment applies.

15 / 36
Choose Your Pattern

Decision Tree

1
Does step 2 depend on step 1's results? If yes → Pipeline. Example: extract data, then query with that data.
2
Do you need multiple specialized capabilities that could run in parallel? If yes → Coordinator. Example: search AND analyze AND visualize independently.
3
Does the system take an irreversible action? If yes → Add Human-in-the-Loop gate before that action. Layer it on top of any pattern.

If your answer is "none of the above, just a multi-tool agent" — that is probably the right answer for a ten-week project. Choosing simplicity is a design skill.

Map Your Project: (1) Which Plane? (2) Which orchestration pattern? (3) What is your main evaluation focus? Choosing simplicity is a design skill.

16 / 36
Architecture & Testing

Contracts and Tests per Pattern

PatternContractKey Test
Pipeline Stage-to-stage schemas (CSV spec, Zod, SQL) Per-stage asserts: input valid → output valid
Coordinator Tool input_schema definitions Correct delegation: right tool for right prompt
Human-in-the-Loop Approval gate interface (action + context shown) Timeout behavior: what happens if user doesn’t respond?

Every orchestration pattern has a contract and a test. If you can’t test it, simplify the pattern.

17 / 36
Design Principle

"Choose the simplest pattern that serves your user. A working demo of a simpler pattern beats a broken demo of a complex one."

Pipeline, Coordinator, Human-in-the-Loop — vocabulary, not requirements.

18 / 36

The Agent
Landscape

What Is Real, What Is Hype
19 / 36
Evaluation Framework

7 Dimensions

# Dimension What It Measures
1 Autonomy level How much can it do without you? Assist → Augment → Automate → Autonomous
2 Context window How much of your codebase can it "see"? Determines task scope and coherence
3 Task scope Single line? File? Feature? Sprint?
4 Human touchpoints Where does a human NEED to be in the loop?
5 Cost model Per token? Seat? Task? Real cost per unit of work?
6 Integration CLI? IDE? CI/CD? API? How does it fit your workflow?
7 Failure mode Silent errors? Hallucinated APIs? Overconfident refactoring?

Tool-agnostic framework. Works for any AI coding tool — today's and next month's.

Assist → Augment: Agent starts suggesting actions, not just answering questions

Augment → Automate: Agent executes routine tasks without asking — human reviews output

Automate → Autonomous: Agent handles exceptions and edge cases independently — human sets goals only

Most production systems today operate at Augment or early Automate.

20 / 36
Tool Assessment

Claude Code

1
Autonomy: Augment You direct, it proposes, you approve
2
Context: 200K tokens Entire small-to-medium project in memory
3
Scope: File to multi-file feature Cross-file refactoring, error handling
4
Touchpoints: Every action Human-in-the-loop by default
6
Integration: CLI (terminal) Not IDE-native
7
Failure: Confident wrong refactors Vibe coding hangover is real
21 / 36
Tool Assessment

Cursor / GitHub Copilot

1
Autonomy: Assist Suggest, you accept or reject
2
Context: File to project-level Cursor indexes full project; Copilot expanding
3
Scope: Line to function Moment-to-moment coding velocity
5
Cost: ~$20/month seat Subscription-based
6
Integration: IDE-native Zero context switching
7
Failure: Over-autocomplete Accepting suggestions without reading them

Best for moment-to-moment coding velocity. Not designed for large autonomous tasks.

22 / 36
Tool Assessment

Devin (Cognition)

1
Autonomy: Autonomous Sets up env, writes code, runs tests, produces PR
3
Scope: Full feature + PR Works well with clear specs + good test coverage
4
Touchpoints: Minimal Spec at start, review at end. Gap in between = risk.
5
Cost: Enterprise ($500+/month) Premium pricing
6
Integration: Asynchronous Give task, review later
7
Failure: Hard-to-trace changes Can loop on misunderstood requirements
23 / 36
Tool Assessment

OpenHands (Open Source)

1
Autonomy: Autonomous Comparable to Devin, task-driven
4
Touchpoints: Configurable Supports human-in-the-loop during execution
5
Cost: Open source Self-hosted, pay only for model API calls
7
Failure: Fully transparent Every action logged. Traceable when things go wrong.

Key differentiator: Full transparency. Every file read, edit, terminal command, and LLM decision is logged. If you want to study how autonomous agents work internally, OpenHands is the tool.

24 / 36
Choosing Your Stack

Tool Comparison: 7 Dimensions

Dimension Cursor Claude Code v0 GitHub Copilot
InterfaceIDE (VS Code fork)CLI / TerminalWeb chatIDE extension
AutonomyAugment–AutomateAutomate–AutonomousAssist–AugmentAssist–Augment
Best forFull-stack prototypingCLI workflows, refactoringUI generationInline completions
ContextFull repo + .cursorrulesFull repo + CLAUDE.mdSingle promptOpen file + neighbors
Multi-fileYes (Composer)Yes (Agent mode)LimitedLimited
Cost$20/mo$20/mo (Max plan)Free tier + $20/mo$10/mo
Learning curveLow (VS Code familiar)Medium (CLI)Very lowVery low

No single tool wins every dimension. Pick based on the task, not the hype.

25 / 36
The Honest Summary

"The more autonomous the tool, the more you need to understand the codebase to review its output. Autonomy and understanding are not substitutes — they are complements."

This is why we spent nine weeks teaching you to understand code, not just generate it.

26 / 36

Demo Day
Preparation

From Working App to Compelling Demonstration
27 / 36
Course Framework Callback

The Last 40%

40 / 20 / 40

Planning → Coding → Testing & Polish

We spent Weeks 1-4 learning to plan.

Weeks 5-7 focused on building.

Week 8 introduced testing.

Tonight is the verification phase — the 40% that separates "it works on my machine" from "it's ready for users."

Include an architecture diagram in your demo. Show the audience your system, not just your UI. The diagram proves you understand what you built.

28 / 36
Demo Preparation

Frame Your Demo with JTBD

Template

When [situation/trigger],

I want to [motivation/action],

So I can [desired outcome].

Example

When I arrive at the GIX makerspace and need a specific tool,

I want to ask a chatbot what's available and where it is,

So I can start building without wasting 15 minutes searching.

Open your demo with the JTBD statement. It tells the audience why this matters before you show how it works.
29 / 36
Before Anything Else

The Non-Negotiable

Your app must be deployed at a public URL by Demo Day.

"It works on localhost" is not a demo. It is a prototype.

30 / 36
Demo Day

The 4-Part Demo Arc

Time Part Content
0:00 – 0:15 Problem One specific person, one specific pain, quantified. "Maria spends 45 min/week answering the same 12 questions."
0:15 – 0:30 Solution One sentence. No feature list. "We built a chatbot that answers equipment questions instantly, 24/7."
0:30 – 5:30 Live Demo Wow moment in first 90 seconds. Happy path. One AI feature, clearly shown. Real data.
5:30 – 6:00 Reflection Specific and honest. "We discovered RLS alone was not enough — that took two days to debug."
31 / 36
Demo Day

The Wow Moment

The 10-second segment where a stranger thinks "that is genuinely impressive." Not "nice." Not "useful." Impressive.

eg
Natural language → live database query in 2 seconds AI turns a question into structured data retrieval with source citation
eg
Log in as different user → dashboard reconfigures instantly Role-based access made visible, no page refresh
eg
One click → AI writes personalized response draft AI doing 5 minutes of human work in 2 seconds

Agent-Capability Wow Moments

🔧 Agent auto-recovers from a failed API call by trying an alternative tool

🧠 Agent chains 3+ tools to answer a question no single tool could handle

🛡 Agent refuses a request that would violate a safety guardrail — and explains why

⚡ Live cost tracking shows the entire interaction cost < $0.05

Must happen in the first 90 seconds. The audience forms its impression in the first two minutes.

32 / 36
Be Prepared

Contingency Planning

Failure Scenario Mitigation
Wi-Fi drops Pre-load app. Cellular hotspot. Pre-recorded 30s screencast.
API rate limit / timeout Pre-recorded screencast of wow moment. Cached responses ready.
Deploy broke overnight Redeploy from GitHub (2 min). Auto-deploy on push. Backup device.
Database empty / cold start Seed with realistic demo data BEFORE presentation. Pre-warm 5 min early.
App crashes mid-demo "This demonstrates something important about production systems." Pivot to screenshots.

Every one of these has happened in a real demo. The teams that recover gracefully are the teams that planned for it.

33 / 36
Testing & Validation

Evaluation Plan Template

Your Eval Plan Must Include:

Show at least one “should refuse” scenario live during demo rehearsal.

Example refusal: “Delete all bookings for user 42” → Agent responds: “I cannot perform bulk deletions. Please contact an administrator.”

34 / 36
Grading

Evaluation Rubric

Criterion Weight What We Look For
Deployed & functional 25% Live URL, works on reviewer's device, no critical bugs
AI feature quality 25% Appropriate use of AI. Solves a real problem. Handles edge cases.
Code quality 20% Types, tests, error handling, no hardcoded secrets, clean repo
Demo narrative 15% Clear JTBD, wow moment, honest reflection
Process & reflection 15% AI usage log, iteration evidence, peer feedback incorporated

A simple app that works perfectly scores higher than a complex app that crashes during demo.

35 / 36
Carry These Forward

Four Takeaways

1
Same loop, two planes Agentic coding (Plane 1) and agentic AI (Plane 2) share the same mechanism. The difference is who the user is — you or your customer.
2
Vocabulary, not requirements Pipeline, Coordinator, Human-in-the-Loop. Choose the simplest pattern that serves your user.
3
Autonomy requires understanding The more autonomous the tool, the more you need to understand the codebase to review its output.
4
Demo Day is a design problem Lead with your wow moment. Deploy before you polish. Prepare for failure.

Feature-complete by end of today. From now until Demo Day: bug fixes and polish only.

Reflection: Name one agentic principle that showed up in 3+ weeks. How did your understanding change from Week 1 to now?

36 / 36