skip to content
benchmarks · Apple M3 Max

Measured, not promised.

Conifer is v0 — correctness-complete and honestly instrumented, every kernel parity-checked against a CPU oracle. Where it is not yet ahead of llama.cpp, the bars say so; closing those gaps is the v1 story. These numbers are the truth as measured on Apple M3 Max, at a 512-token prompt / 128-token decode window, seed 0, batch 1. The most telling row is η — the fraction of this machine's memory bandwidth each engine actually puts to work.

Llama-3.1-8B · Q4_K_M vs MLX-4bit

Decode throughput

higher is better
conifer
44.0 tok/s
llama.cpp
50.6 tok/s
mlx
n/a

Time to first token

lower is better
conifer
1103 ms
llama.cpp
905 ms
mlx
n/a

Prefill throughput

higher is better
conifer
464 tok/s
llama.cpp
565 tok/s
mlx
n/a

Memory-bandwidth utilization (η)

higher is better
conifer
n/a
llama.cpp
0.970
mlx
n/a
Size sweep · decode
TinyLlama-1.1B

Decode throughput

higher is better
conifer
226 tok/s
llama.cpp
213 tok/s
mlx
n/a
Gemma-2-2B

Decode throughput

higher is better
conifer
97.2 tok/s
llama.cpp
84.4 tok/s
mlx
128 tok/s
Qwen2.5-0.5B

Decode throughput

higher is better
conifer
252 tok/s
llama.cpp
250 tok/s
mlx
305 tok/s
Methodology

conifer v0 on the merged perf/kernels engine (commit e2ae06a, 2026-05-24): fused single-pass quant prefill GEMMs + nostage decode GEMVs. Warm best-of-3, Metal backend, Apple M3 Max. llama.cpp b9110 and MLX are reference baselines measured on the identical machine and GGUF. n/a = MLX model not downloaded. eta is computed on conifer's measured roofline for all engines.

Same machine, same 512/128 window, batch 1. conifer: Metal backend, release build. llama.cpp: llama-bench, Metal. MLX: mlx-lm stream_generate. GGUF Q4_K_M and MLX-4bit are both ~4-bit but not bit-identical quantizations — the standard cross-format comparison. η is computed on conifer's exact bytes/token (at the model's context capacity) and measured bandwidth for every engine, so all three share one roofline. It is a relative weight-streaming utilization figure — ratio-preserving across engines, not a vendor performance claim. n/a appears wherever a value was not earned.

Generated 2026-05-24.