Same model · same tokens · same answers. fak prefills the shared agent setup once and reuses it; the naive loop makes the model re-read the whole growing context every turn. Watch the wall-clock gap open — that gap is the value.
① Live race — fak vs naive both arms run live, same model
workload: P=512 T=5 C=5 D=16 R=32 → 25 requests
fak idle—
prefilled 0 · decoded 0
naive (re-prefill every turn) idle—
prefilled 0 · decoded 0
② Reuse curve across the model ladder fak arm LIVE · naive arm projected from measured prefill cost
same workload, smaller P=128 for tractability on CPU. As the model grows, the absolute minutes saved grow with it — the ratio holds.
naive (re-prefill)fak (reuse)
Each rung: fak runs the session live; the naive bar is projected from that model's measured prefill cost (running the naive arm live at 3B would take ~an hour — it re-prefills the whole context every turn). The A/C ratio is timing-free and model-independent: it's fixed by the session shape.
fak in-kernel engine · pure-Go Q8 forward pass · tokens are real model output (anchor-quality on the 135M reference, chat-quality on the Qwen2.5 rungs). No network, no API, all on this box.