ishan.sh build-off — Qwen3.6-27B (local) vs GPT-5.5
[GENERATED - Claude Opus 4.6]
Same PLAN.md, same target site, two agents. Qwen3.6 ran locally
(unsloth/Qwen3.6-27B-GGUF:Q4_K_M via llama.cpp); GPT-5.5 ran via API.
Both produced a working Eleventy site, but the differences in correctness,
output cleanliness, and time-to-completion were stark.
TL;DR
| Qwen3.6-27B (Q4_K_M) | GPT-5.5 | |
|---|---|---|
| Wall time | ~69 min (4143s active+idle) | substantially shorter |
| Output tokens | 46,054 | n/a (remote) |
| Prompt tokens consumed | 277,568 | n/a (remote) |
| Decode rate (start → end) | 28.1 → 16.2 tok/s (-42%) | n/a |
| Visible bugs in artifact | name typo, broken SVG, broken slugs | none observed |
Template files in _includes |
5 (base, page, post, post-card, post-nav) | 3 (base, article, card) |
| Stray files in repo | README.11tydata.js, PLAN.11tydata.js |
none |
The two outputs look similar in structure, but Qwen3.6 made several small mistakes that all share a pattern: it generated plausible-looking content from imagination instead of grounding in either the plan or existing files.
Performance (llama.cpp metrics)
From agent/logs/experiments/ishandotsh-build-20260503-041404/:
- Run length: 4142s wall (~69 min). 3942s active, 200s idle.
- Output: 46,054 tokens generated, 277,568 prompt tokens processed.
- Prompt:processed-output ratio: ~6:1. That's the "kept re-reading long files" symptom — every step re-shipped a lot of context to the model rather than relying on a working memory of what it had already seen.
- Decode rate degraded over the run:
- First sample: 28.08 tok/s
- Final sample: 16.19 tok/s (-42%)
- First-half average: 23.44 tok/s
- Second-half average: 17.13 tok/s
- Min: 16.19, Max: 28.09 (across 403 ten-second samples)
This is exactly the context-window slowdown you flagged: as the KV cache fills, decode gets cheaper-per-step but slower-per-token, and the re-prompting of long files makes prompt-eval the dominant cost. By the back half of the run, throughput is below 60% of the cold-start rate.
Correctness — what each one shipped
Qwen3.6-27B mistakes
-
Name typo: "Ischan" — appears twice in
_data/site.json("author": "Ischan","description": "Personal blog and experiments by Ischan") and inabout.md("Hi, I'm Ischan."). The model hallucinated a spelling and never grounded against the directory nameishandotsh-qwen3.6or the domainishan.sh. -
Broken GitHub SVG icon — the
<path d="...">in_includes/base.njkcontains the fragment0 0.09 2 .27 1.53-1.04 2.19-.82 2.19-.82partway through. That's not valid path-command syntax (a stray0operator with bad arg count), so the icon renders as garbage. Classic generative-from- memory failure: the model "knew" what a GitHub octocat path looks like but couldn't actually reproduce it character-perfect. -
Broken
/posts/:slugroute —posts/posts.11tydata.js:permalink: (data) => { ... return `/posts/${data.slug || data.title}/`; }No slugify on the fallback. If a post lacks an explicit
slugfield, the raw title (with spaces, capitals, punctuation) gets dropped straight into the URL. GPT-5.5 handled this correctly:return `/posts/${data.slug || slugify(data.title)}/`; -
Tag slug collapse — Qwen's tag for "ML/AI" became
mlai(slash dropped entirely, no separator). GPT-5.5 producedml-ai. Likely Qwen's slugify regex strips non-alphanum without inserting a separator. -
Stray files — unexpected leftovers in the repo:
README.11tydata.js,PLAN.11tydata.js(pointless 11ty data files attached to docs that don't need front-matter or computed permalinks) that wasn't asked for. None of these are part of the plan. -
Placeholder social link —
about.mdhashttps://github.com/(no username) andhttps://x.com/(no handle). GPT-5.5 at least putgithub.com/ishan/x.com/ishanplaceholders that signal "fill this in".
GPT-5.5
No correctness bugs surfaced from filesystem inspection. Notes:
- Cleaner template factoring: 3 includes (
base,article,card) versus Qwen's 5 (base,page,post,post-card,post-nav). Qwen split things that didn't need splitting. - Bundled Fuse.js locally (passthrough-copy from
node_modules) instead of fetching from CDN at runtime — better for offline / privacy / reliability. - Has
.github/with what's presumably a Pages workflow, plus.eleventyignore. Qwen has neither. - Routing logic uses a clean
slugifyhelper exported from.eleventy.jsand reused in the data file, so post and experiment permalinks share a single implementation.
Behavioral pattern
The slow-and-error-prone failure mode for Qwen3.6 looks like one underlying cause: insufficient compression. It can't hold a useful abstraction of a file in working memory, so it re-reads files end-to- end, which:
- inflates prompt tokens (277k of them in this run),
- pushes the context window toward saturation,
- collapses tok/s as the KV cache grows,
- and still leaves the model unable to keep facts straight (your name, the SVG, the slug fallback) because it never built a stable representation in the first place.
GPT-5.5 mostly grounds against the actual files, makes one focused
edit, and moves on. The artifacts show that: smaller _includes,
fewer files, no stray docs-with-data-files, no fictional spellings.