01 // Inference benchmarks
Single-stream decode · llama.cpp
# env llama.cpp b4732 · 4096 ctx · batch=1 · prompt=512 · temp=0.0 · median of 5 runs
02 // Hardware specs
ArchitectureM3 Ultra
Process nodeTSMC 3nm
Memory512 GB
Memory bandwidth819 GB/s
FP16 compute57 TFLOPS
INT8 compute114 TOPS
TDP295 W
PCIeUnified
Form factorDesktop
CoolingActive
03 // Model fit
Approximate VRAM required to load weights + 4096 ctx KV cache.
+ STRENGTHS
- ✓512GB VRAM is enough for 200B+ models at Q4
- ✓819 GB/s memory bandwidth · top tier in its class
- ✓Strong tooling: FP16, Q8, Q4, MLX all officially supported
− TRADE-OFFS
- −Draws 295W under load — plan PSU and thermals accordingly
- −$9,499 puts this firmly in pro tier
- −Mac-only — CUDA tooling won't run