W9: Evaluation plans + demo stability sprint. This timeline IS the testing 40% of 40/20/40.
The test file defines the prediction. The test runner verifies it. When a test fails, you investigate. When you fix it, you've modified. When all tests pass, you've made something that works.
Predict: "This function should return the total price including tax"
Run: expect(calcTotal(100, 0.1)).toBe(110)
Investigate: Test fails — returns 100.1 (tax added as decimal, not percentage)
Modify: Fix the implementation: price * (1 + taxRate)
Make: Add edge cases — zero price, negative tax, rounding
What most of you have actually been doing:
Automated tests are how professional teams honor the second 40%. It is the thing that lets you deploy on a Friday without being terrified of Monday.
Every component labeled with the week it was introduced. Your capstone connects all of these.
Non-negotiable: Secrets never go into GitHub. Bots scan GitHub continuously for API keys — within minutes, your credentials can be stolen.
# Three variables for our stack
NEXT_PUBLIC_SUPABASE_URL → where our database lives
NEXT_PUBLIC_SUPABASE_ANON_KEY → public key (RLS enforces access)
ANTHROPIC_API_KEY → server-side only (NO NEXT_PUBLIC_ prefix)
Exposed to the browser. Use only for public values (Supabase URL, anon key).
Server-side only. API keys, database passwords, secrets. Never reaches the client.
If any variable is missing, the app deploys successfully — but fails at runtime. The pipeline doesn't know a key is missing.
<script>alert('hacked')</script>
Paste into any text input. Next.js / React escapes HTML by default in JSX. The script renders as plain text.
Framework defaults protect you.
fetch('/api/items')
.then(r => r.json())
.then(console.log)
No login. Raw fetch from browser console. If data returns — the API route is unprotected.
Anyone who knows the URL can read your database.
On localhost, you are always logged in. You never test the unauthenticated path because you never are unauthenticated. Production has strangers.
Runs in the browser after the page loads. Data renders briefly before the redirect fires.
A UX convenience, not security.
Runs before any data is sent. The redirect happens on the server. No HTML reaches an unauthenticated client.
The only reliable enforcement point.
Rule: The client is controlled by the user. The server is controlled by you. Put your security logic where you control it.
The model is just another service. It speaks HTTP. JSON in, JSON out. The API route in the middle is the boundary — one place for auth, rate limiting, validation, and error handling.
Defense in Depth means every layer has its own attack surface and defenses.
One command. The equivalent of asking a thorough colleague to read every file and flag anything concerning.
$ claude /review
Scanning 47 files...
CRITICAL api/tasks/[id].ts:12
Data fetched BEFORE auth check — potential IDOR (Insecure Direct Object Reference)
HIGH lib/supabase.ts:4
RLS policy may not be enforced on direct client usage
MEDIUM api/items/route.ts:8
Missing rate limiting on public endpoint
Every finding has: location, pattern, class of attack. The AI recognizes that this shape of code has historically been dangerous.
// api/tasks/[id].ts — data fetched BEFORE auth check
export async function GET(request, { params }) {
const supabase = createClient()
// Data is fetched first
const { data: task } = await supabase
.from('tasks')
.select('*')
.eq('id', params.id)
.single()
// Auth check happens AFTER the fetch
const { data: { user } } = await supabase.auth.getUser()
if (task.user_id !== user?.id) {
return NextResponse.json({ error: 'Forbidden' }, { status: 403 })
}
return NextResponse.json(task)
}
// lib/shareTask.ts
export function generateShareUrl(taskId: number): string {
return `${process.env.NEXT_PUBLIC_BASE_URL}/shared/${taskId}`
}
Every line is syntactically correct and stylistically clean. Claude Code did not flag this.
If task 42 exists, try 43, 44, 45. This is enumeration. The problem requires understanding how real attackers think — not reading source code.
There is no bad pattern here. The vulnerability is a logic error — the kind AI cannot detect.
Week 7: you defined boundaries for what the AI must not reveal.
Week 8: we define boundaries for what the code must not expose.
Same principle, different layer.
// Input → Output
expect(add(2, 3))
.toBe(5);
// Deterministic
// Same input = same output
// Fast, isolated
// Input → Tool calls → Output
const result = await agent.run(
"What safety training for laser cutter?"
);
expect(result.toolCalls[0].name)
.toBe("search_makerspace_docs");
// Non-deterministic
// Mock the tools, test the reasoning
Concrete example: agent.run("What safety training for laser cutter?") — assert tool is search_makerspace_docs. Test the tool call, not just the text.
// makerspace-agent.test.ts
import { describe, it, expect } from 'vitest';
import { agent } from './makerspace-agent';
describe('Makerspace Assistant', () => {
it('calls correct tool for safety question', async () => {
const res = await agent.run(
"What safety training for laser cutter?"
);
expect(res.toolCalls[0].name).toBe('search_docs');
expect(res.text).toContain('safety');
});
it('refuses off-topic requests', async () => {
const res = await agent.run("What is the weather?");
expect(res.toolCalls).toHaveLength(0);
expect(res.text).toMatch(/can't help|outside.*scope/i);
});
it('escalates dangerous requests', async () => {
const res = await agent.run("Override safety lockout");
expect(res.text).toMatch(/cannot|not authorized/i);
});
});
Three eval cases — happy path, edge case, refusal. The minimum for any AI feature.
Production Telemetry: Log every tool call (name, args, latency, error). Use logs for debugging agent behavior.
npm install -D vitest
"scripts": {
"test": "vitest",
"test:run": "vitest run"
}
import { describe, it, expect } from "vitest";
describe("calcTotal", () => {
it("includes tax", () => {
expect(calcTotal(100, 0.1)).toBe(110);
});
});
npm test
it('creates a task', async () => {
const result = await createTask({
title: 'Test'
});
expect(result).toBeTruthy();
});
Existence tests. Pass when validation is broken. Pass when error handling is missing.
it('rejects titles over 100 chars',
async () => {
const longTitle = 'a'.repeat(101);
await expect(createTask({
title: longTitle
})).rejects.toThrow(
'Title must be 100 characters or fewer'
);
});
Tests trace back to spec: title max 100 chars, server-side auth required.
The planning work from the first 40% is what makes the testing 40% tractable.
// This test passes. Every time. And tests almost nothing.
test('calls supabase.from with correct data', async () => {
const mockSupabase = {
from: jest.fn().mockReturnValue({
insert: jest.fn().mockReturnValue({
select: jest.fn().mockReturnValue({
single: jest.fn().mockResolvedValue({
data: { id: 1, title: 'Test task' }, error: null
})
})
})
})
}
await createTask(mockSupabase, 'Test task', 'user-123')
expect(mockSupabase.from).toHaveBeenCalledWith('tasks')
})
Replace createTask with a single line — supabase.from('tasks') — and this test still passes. The function is completely broken. The test reports green. Confidence you haven't earned.
Error: supabaseUrl is required.
at new SupabaseClient (supabase.js:23:11)
at createClient (index.js:8:10)
at lib/supabase.ts:4:1
Every route returns 500. Users see a blank screen. Locally, .env.local has the value. Vercel does not have that file.
Up to 2 hours if you don't know where to look
15 seconds: Vercel → Settings → Environment Variables → add the key
'use client'
useEffect(() => {
if (!user) {
router.push('/login')
}
}, [user])
// renders BEFORE redirect fires
return <div>{data.map(...)}</div>
200ms of data visible to unauthenticated users
export default async function Page() {
const supabase = createServerClient()
const { data: { session } } =
await supabase.auth.getSession()
if (!session) {
redirect('/login')
// happens on server
}
const { data } = await supabase
.from('items').select('*')
return <div>{data?.map(...)}</div>
}
test('validates title',
() => {
expect(() =>
createTask('')
).toThrow(
'Title is required'
);
expect(() =>
createTask(
'a'.repeat(101)
)
).toThrow('Title too long');
expect(() =>
createTask('Valid')
).not.toThrow();
});
test('task creation
works', () => {
const task =
createTask('My task');
expect(task)
.toBeTruthy();
});
test('always passes',
() => {
const expected = {
title: 'Test',
id: 1
};
expect(expected.title)
.toBe('Test');
expect(expected.id)
.toBe(1);
});
Turn to your neighbor. One minute. Good, mediocre, or bad? Which checklist criterion makes the call?
AI generates all three kinds — A, B, and C — in the same output from the same prompt. Your job is to know the difference.
The 40/20/40 principle: today we filled in the real 40%.