Measured, not promised.
Conifer is v0 — correctness-complete and honestly instrumented, every kernel parity-checked against a CPU oracle. Where it is not yet ahead of llama.cpp, the bars say so; closing those gaps is the v1 story. These numbers are the truth as measured on Apple M3 Max, at a 512-token prompt / 128-token decode window, seed 0, batch 1. The most telling row is η — the fraction of this machine's memory bandwidth each engine actually puts to work.
Decode throughput
higher is betterTime to first token
lower is betterPrefill throughput
higher is betterMemory-bandwidth utilization (η)
higher is betterDecode throughput
higher is betterDecode throughput
higher is betterDecode throughput
higher is betterconifer v0 on the merged perf/kernels engine (commit e2ae06a, 2026-05-24): fused single-pass quant prefill GEMMs + nostage decode GEMVs. Warm best-of-3, Metal backend, Apple M3 Max. llama.cpp b9110 and MLX are reference baselines measured on the identical machine and GGUF. n/a = MLX model not downloaded. eta is computed on conifer's measured roofline for all engines.
Same machine, same 512/128 window, batch 1. conifer: Metal backend, release build. llama.cpp: llama-bench, Metal. MLX: mlx-lm stream_generate. GGUF Q4_K_M and MLX-4bit are both ~4-bit but not bit-identical quantizations — the standard cross-format comparison. η is computed on conifer's exact bytes/token (at the model's context capacity) and measured bandwidth for every engine, so all three share one roofline. It is a relative weight-streaming utilization figure — ratio-preserving across engines, not a vendor performance claim. n/a appears wherever a value was not earned.
Generated 2026-05-24.