A spec-driven code generation sandbox.
A spec-driven code generation sandbox.
MuggsOfCode is a spec-driven code-generation sandbox built for a high-school CS course — but the thing it actually teaches is older and bigger than the course.
The skill is precision of intent. There are three ways people write code with AI: vibe coding (“just make me a thing”), AI-assisted coding (“finish the line I’m typing”), and spec-driven development (“here is exactly what I want — build to it”). The third is the oldest by decades and the most transferable, because what it trains isn’t “use this particular AI.” It’s express what you want clearly enough that anyone — a person, a model, or future-you — can produce the right thing from it. That skill outlives any single model, company, or year.
Vibe coding isn’t the enemy of that — it’s the on-ramp. The tool opens by asking you to choose: start in vibecode (describe it, watch it appear) or in SDD (write the blueprint first). Vibecode is where the instinct lives; SDD is the discipline it grows into. The prompt-craft you build talking loosely to a model is the raw material of a real specification. Neither path is framed as the lesser one.
The AI is a critic, never a co-author. It asks Socratic questions about your spec, flags vague language, and checks whether your own comments match the code — but it will not pick a design choice, write a comment, or fix your code. An AI that did the work for you would quietly delete the lesson. The guidance is advisory, never coercive, and the interface stays honest about what it does and doesn’t keep.
Smallness is a feature. The model runs on a single GPU in my living room, not a frontier API. A small local model has less slack, so a vague spec produces visibly weaker output — and that gap, between what you meant and what you got, is the whole lesson. It’s also nearly free to run, which is the other half of the idea: do the cheap, verbose scaffolding here, then carry a compact package to a bigger model’s free tier for the hard part. A kid with no budget can start something genuinely ambitious by spending borrowed tokens only where they actually matter.
Nothing is saved between sessions. Close the window and the chat is gone. What you carry out — the spec, the artifact — is the only thing that lasts. That isn’t a limitation; it’s the thesis, made literal.
Chats are disposable. Specs are durable.
A spec-driven code-generation sandbox for a high-school CS course. You write a specification in markdown; a small language model running on a local GPU synthesizes a single-file HTML page from it; you read the generated code, comment what you understand, and iterate. You enter through a choice of two paths — Vibecode (describe it, watch it build) or SDD (write the blueprint first) — and the SDD path opens into a four-tab workspace.
There are three primary ways AI assists in the writing of code: vibe coding (“just make me a thing”), AI-assisted coding (“complete what I’m typing”), and spec-driven development (“here’s a precise specification — build to it”). This tool trains the third.
Spec-driven development is the oldest of the three by decades and the most transferable to real engineering. Working software has been designed from specifications since the 1960s; what changed in 2022 is that the spec→code translation got cheap enough that the spec can be a living document instead of an archived one.
The skill you’re building isn’t “use AI to write code.” It’s express your intent precisely enough that anyone — a person, a model, or future-you — can produce the right thing from it. That skill transfers to any AI tool, any year, any model size.
You pick a path. Vibecode is the fast, freeform first pass — describe it, watch it appear, refine by talking. SDD is the discipline that instinct grows into: write the blueprint first. Neither is framed as the lesser one.
Four panels — Project, Structure,
Style, Behavior — that force you
to decompose intent into the categories a webpage actually has. A
START WITH dropdown loads starter templates with
{{slots}} to fill in.
The synthesized HTML/CSS/JS, in the same VS Code Dark+ palette you’ll meet in a real editor. It’s framed as a compiled artifact: you read it and comment it, but you change the design by amending the spec, not by hand-patching the code.
The rendered page — plus a Troubleshoot box. Describe a symptom (“the fonts don’t look right”) and the AI diagnoses the cause and helps you strengthen the spec, then you re-synthesize.
Nothing is saved between sessions, so Export is the only durable
memory: a self-contained .html that carries its own
spec and QC inside it, the loose source files, and a compact
handoff package for continuing on a bigger model.
Markdown QC and Preview QC slide in from the right — lenses you open when you want to inspect your own work, never gates that block you.
Students who do their best work on this tool do something before they ever open it: they draw the page on graph paper. Top to bottom. Header here, hero there, three cards in a row, a form, a footer. The drawing doesn’t have to be neat. The point is that the hardest decisions — what goes on the page, in what order, with what hierarchy — get made physically, with a pencil, before any spec gets typed.
When you then sit at the tool, the Structure panel is mostly transcription: read what you drew, top to bottom, into a numbered list. Project is what the drawing is FOR. Style annotates the visual choices you made on paper. Behavior notes the things you implicitly drew as clickable or moving.
Page mapping turns spec-writing from intimidating abstraction into reading-aloud-from-paper.
SYNTHESIZE →. The spec goes to the local model, which returns a single-file page..html, or a handoff package to continue on a bigger model’s free tier.// ASK AI //
Each spec section has its own ASK AI button. Returns 3–4
Socratic questions about that section only — aware of which
template you picked and of any unreplaced {{slots}}.
// REVIEW //Critiques the full four-section spec: vague language, missing decisions, contradictions between sections, scope mismatches.
// CHECK COMMENTS //After you add your own comments to the generated code, this checks whether each comment is valid syntax for its language context and whether it actually describes the surrounding code. It will NOT write or rewrite your comments.
// TROUBLESHOOT //On Browser Preview, describe a symptom and the AI diagnoses the cause and asks you a guiding question first — then, only if you ask, proposes a spec change you insert into your own blueprint. It helps you strengthen the spec; it never patches the code.
These are curriculum guardrails enforced in the system prompts. The AI is an interviewer, critic, and verifier — never a co-author.
This tool talks to a local LLM running on GPUs in a closet, not to a cloud API. Two reasons that matter: cost (a class hitting an API would burn through budget fast) and pedagogy (a small local model has less slack, so vague specifications produce visibly weaker output — the gap teaches precision).
Two cards do the work, and a request may land on either one — or both at once — depending on how many people are building in parallel:
Both run llama-server from llama.cpp,
serving Qwen3-8B as the target with
Qwen3-1.7B as a speculative draft. So the tool
is powered by the 4090, the 5070 Ti, or both at once —
whichever has capacity when you hit SYNTHESIZE.
Modern LLMs generate text one token at a time, each token a full forward pass — which makes large models slow.
Speculative decoding cheats that step. A small “draft” model (Qwen3-1.7B) proposes the next several tokens cheaply. The target model (Qwen3-8B) then checks them in a single pass, accepting as many as match its own prediction. If the draft was mostly right, you collect several tokens for the cost of one — typically a 1.8×–3× speedup on code generation.
It only works when the two models share a tokenizer, which is why we pair Qwen3-1.7B with Qwen3-8B rather than mixing families.
v0.1 was the first usable version — spec writing, synthesis, code reading, comment-checking, auth and infrastructure. v0.2 (teacher-only pilot) added a five-stage loop with two mandatory QC audit stages.
v0.3 streamlined the whole tool. The horizontal loop became a lighter, tabbed workspace entered through a start screen (Vibecode or SDD). The two QC checklists became optional right-side slideouts — lenses, not gates. The code is framed as a compiled artifact: you read and comment it, but you change the design by amending the spec. The Repair / Troubleshoot loop turns a plain-language preview note into a proposed spec change you accept into your blueprint, then re-synthesize. And nothing persists between sessions — Export is the only durable memory.
Export also includes a handoff: a compact, token-lean package — your spec, the current artifact, and a continue-from-here instruction — meant to paste into a free tier of ChatGPT, Gemini, or Claude. You scaffold the project here on local hardware, then spend a bigger model’s limited free budget only on the hard part.
Still held for later: publish-to-GitHub for student-owned URLs, inline syntax-error markers, vision-LLM ingest of hand-drawn paper specs, and model upgrades as new open-weights releases arrive.
Built by Sean Muggivan as a teaching tool for high-school computer science. If you want access to try it, email sean@muggivanlcsw.me.