The same model (Qwen2.5-0.5B-Instruct) served by the same fak kernel, decoded two
ways: the pure-Go CPU reference path, and the wired-in CUDA device path (fak serve --backend cuda).
No ollama on either side — this is fak's own in-kernel engine. Send a prompt and watch both decode live.
/cpu//gpu/Reference measurements (server-side decode, identical 21-token prompt): CPU ≈ 28 tok/s · GPU ≈ 99 tok/s (~3.5× faster). The tok/s shown above is end-to-end (includes prefill + the kernel's per-request weight upload to VRAM), so it reads lower than the pure decode rate. The kernel adjudicates tool calls on both paths — security sits on the hot path regardless of engine.