benchmarks · Apple M3 Max

Measured, not promised.

Conifer is v0 — correctness-complete and honestly instrumented, every kernel parity-checked against a CPU oracle. Where it is not yet ahead of llama.cpp, the bars say so; closing those gaps is the v1 story. These numbers are the truth as measured on Apple M3 Max, at a 512-token prompt / 128-token decode window, seed 0, batch 1. The most telling row is η — the fraction of this machine's memory bandwidth each engine actually puts to work.

Llama-3.1-8B · Q4_K_M vs MLX-4bit

Decode throughput

higher is better

conifer

44.0 tok/s

llama.cpp

50.6 tok/s

mlx

n/a

Time to first token

lower is better

conifer

1103 ms

llama.cpp

905 ms

mlx

n/a

Prefill throughput

higher is better

conifer

464 tok/s

llama.cpp

565 tok/s

mlx

n/a

Memory-bandwidth utilization (η)

higher is better

conifer

n/a

llama.cpp

0.970

mlx

n/a

Size sweep · decode

TinyLlama-1.1B

Decode throughput

higher is better

conifer

226 tok/s

llama.cpp

213 tok/s

mlx

n/a

Gemma-2-2B

Decode throughput

higher is better

conifer

97.2 tok/s

llama.cpp

84.4 tok/s

mlx

128 tok/s

Qwen2.5-0.5B

Decode throughput

higher is better

conifer

252 tok/s

llama.cpp

250 tok/s

mlx

305 tok/s

Methodology

conifer v0 on the merged perf/kernels engine (commit e2ae06a, 2026-05-24): fused single-pass quant prefill GEMMs + nostage decode GEMVs. Warm best-of-3, Metal backend, Apple M3 Max. llama.cpp b9110 and MLX are reference baselines measured on the identical machine and GGUF. n/a = MLX model not downloaded. eta is computed on conifer's measured roofline for all engines.

Same machine, same 512/128 window, batch 1. conifer: Metal backend, release build. llama.cpp: llama-bench, Metal. MLX: mlx-lm stream_generate. GGUF Q4_K_M and MLX-4bit are both ~4-bit but not bit-identical quantizations — the standard cross-format comparison. η is computed on conifer's exact bytes/token (at the model's context capacity) and measured bandwidth for every engine, so all three share one roofline. It is a relative weight-streaming utilization figure — ratio-preserving across engines, not a vendor performance claim. n/a appears wherever a value was not earned.

Generated 2026-05-24.