⚡ In-kernel engine — CPU vs GPU

The same model (Qwen2.5-0.5B-Instruct) served by the same fak kernel, decoded two ways: the pure-Go CPU reference path, and the wired-in CUDA device path (fak serve --backend cuda). No ollama on either side — this is fak's own in-kernel engine. Send a prompt and watch both decode live.

🧮 CPU

pure-Go reference · 8 vCPU · /cpu/

—

decode · tok/se2e ·· tok

🟢 GPU

CUDA HAL · NVIDIA L4 · /gpu/

—

decode · tok/se2e ·· tok

Reference measurements (server-side decode, identical 21-token prompt): CPU ≈ 28 tok/s · GPU ≈ 99 tok/s (~3.5× faster). The tok/s shown above is end-to-end (includes prefill + the kernel's per-request weight upload to VRAM), so it reads lower than the pure decode rate. The kernel adjudicates tool calls on both paths — security sits on the hot path regardless of engine.