01 // Inference benchmarks
Single-stream decode · llama.cpp
# env llama.cpp b4732 · 4096 ctx · batch=1 · prompt=512 · temp=0.0 · median of 5 runs
01b // Performance across quantization
vs. nearest competitors
How tok/s scales from FP16 → Q8 → Q4 compared to GPUs in a similar price/VRAM range.
02 // Hardware specs
ArchitectureM3 Ultra
Process nodeTSMC 3nm
Memory512 GB
Memory bandwidth819 GB/s
FP16 compute57 TFLOPS
INT8 compute114 TOPS
TDP295 W
PCIeUnified
Form factorDesktop
CoolingActive
03 // Model fit
Approximate VRAM required to load weights + 4096 ctx KV cache.
+ STRENGTHS
- ✓512GB VRAM is enough for 200B+ models at Q4
- ✓819 GB/s memory bandwidth · top tier in its class
- ✓Strong tooling: FP16, Q8, Q4, MLX all officially supported
− TRADE-OFFS
- −Draws 295W under load — plan PSU and thermals accordingly
- −$9,499 puts this firmly in pro tier
- −Mac-only — CUDA tooling won't run
related research
Research behind Apple M3 Ultra inference tradeoffs
These papers explain the quantization, cache, bandwidth, and runtime constraints that matter before buying this GPU for local AI.
Local AI inference papers
llama.cpp, Apple Silicon, constrained GPUs, offload, and one-box inference.
Open KV cache optimization papers
Cache quantization, compression, reuse, and long-context memory pressure.
Open LLM quantization research
GPTQ, AWQ, GGUF, FP4, NF4, and what low-bit formats mean for VRAM fit.
Open