01 // Inference benchmarks
Single-stream decode · llama.cpp
# env llama.cpp b4732 · 4096 ctx · batch=1 · prompt=512 · temp=0.0 · median of 5 runs
01b // Performance across quantization
vs. nearest competitors
How tok/s scales from FP16 → Q8 → Q4 compared to GPUs in a similar price/VRAM range.
02 // Hardware specs
ArchitectureRDNA 3
Process nodeTSMC 5nm + 6nm
Memory24 GB
Memory bandwidth960 GB/s
FP16 compute61.4 TFLOPS
INT8 compute123 TOPS
TDP355 W
PCIeGen 4 x16
Form factorTriple-slot
CoolingAxial
03 // Model fit
Approximate VRAM required to load weights + 4096 ctx KV cache.
+ STRENGTHS
- ✓24GB VRAM is enough for 32B-class models at Q4
- ✓960 GB/s memory bandwidth · top tier in its class
- ✓Strong tooling: FP16, Q8, Q4 all officially supported
− TRADE-OFFS
- −Draws 355W under load — plan PSU and thermals accordingly
- −Limited to triple-slot chassis
- −Driver lock-in to vendor stack
related research
Research behind Radeon RX 7900 XTX inference tradeoffs
These papers explain the quantization, cache, bandwidth, and runtime constraints that matter before buying this GPU for local AI.
GPU inference optimization papers
Memory bandwidth, FlashAttention, dequant kernels, and backend maturity.
Open Local AI inference papers
llama.cpp, Apple Silicon, constrained GPUs, offload, and one-box inference.
Open LLM serving systems papers
vLLM, PagedAttention, speculative decoding, batching, and GPU servers.
Open