# GPU Hunter Full LLM Context > GPU Hunter is a research-backed GPU index for local AI inference. ## Site Summary GPU Hunter helps engineers and builders choose GPUs for local AI inference by comparing VRAM, memory bandwidth, price, power, quantization support, and published local LLM benchmark data. The commercial intent is hardware selection: route readers toward /browse, /compare, GPU detail pages, and buying guides. ## Freshness - GPU data last updated: Apr 30, 2026 - Research library last updated: May 27, 2026 ## Current Data Scope - GPUs tracked: 20 - Models profiled: 10 - Primary hardware decision pages: https://www.gpuhunter.io/browse, https://www.gpuhunter.io/compare, https://www.gpuhunter.io/gpu/[id] - Research hub: https://www.gpuhunter.io/research ## GPU Data Overview - RTX PRO 6000 Blackwell (rtx-pro-6000-blackwell): NVIDIA Blackwell, 96GB VRAM, 1792 GB/s bandwidth, 141 tok/s Llama 8B Q4, $8499; canonical https://www.gpuhunter.io/gpu/rtx-pro-6000-blackwell - GeForce RTX 5090 (rtx-5090): NVIDIA Blackwell, 32GB VRAM, 1792 GB/s bandwidth, 145 tok/s Llama 8B Q4, $1999; canonical https://www.gpuhunter.io/gpu/rtx-5090 - GeForce RTX 4090 (rtx-4090): NVIDIA Ada Lovelace, 24GB VRAM, 1008 GB/s bandwidth, 104 tok/s Llama 8B Q4, $1799; canonical https://www.gpuhunter.io/gpu/rtx-4090 - GeForce RTX 3090 (rtx-3090): NVIDIA Ampere, 24GB VRAM, 936 GB/s bandwidth, 87 tok/s Llama 8B Q4, $749; canonical https://www.gpuhunter.io/gpu/rtx-3090 - NVIDIA DGX Spark (dgx-spark): NVIDIA GB10 Grace Blackwell, 128GB VRAM, 273 GB/s bandwidth, 45 tok/s Llama 8B Q4, $3999; canonical https://www.gpuhunter.io/gpu/dgx-spark - Apple M3 Ultra (m3-ultra): Apple M3 Ultra, 512GB VRAM, 819 GB/s bandwidth, 92 tok/s Llama 8B Q4, $9499; canonical https://www.gpuhunter.io/gpu/m3-ultra - Apple M4 Max (m4-max): Apple M4 Max, 128GB VRAM, 546 GB/s bandwidth, 83 tok/s Llama 8B Q4, $4699; canonical https://www.gpuhunter.io/gpu/m4-max - GeForce RTX 5080 (rtx-5080): NVIDIA Blackwell, 16GB VRAM, 960 GB/s bandwidth, 92 tok/s Llama 8B Q4, $999; canonical https://www.gpuhunter.io/gpu/rtx-5080 - GeForce RTX 5070 Ti (rtx-5070-ti): NVIDIA Blackwell, 16GB VRAM, 896 GB/s bandwidth, 86 tok/s Llama 8B Q4, $749; canonical https://www.gpuhunter.io/gpu/rtx-5070-ti - GeForce RTX 5070 (rtx-5070): NVIDIA Blackwell, 12GB VRAM, 672 GB/s bandwidth, 65 tok/s Llama 8B Q4, $549; canonical https://www.gpuhunter.io/gpu/rtx-5070 - GeForce RTX 4080 SUPER (rtx-4080-super): NVIDIA Ada Lovelace, 16GB VRAM, 736 GB/s bandwidth, 78 tok/s Llama 8B Q4, $899; canonical https://www.gpuhunter.io/gpu/rtx-4080-super - GeForce RTX 4070 Ti SUPER (rtx-4070-ti-super): NVIDIA Ada Lovelace, 16GB VRAM, 672 GB/s bandwidth, 70 tok/s Llama 8B Q4, $699; canonical https://www.gpuhunter.io/gpu/rtx-4070-ti-super - GeForce RTX 3090 Ti (rtx-3090-ti): NVIDIA Ampere, 24GB VRAM, 1008 GB/s bandwidth, 94 tok/s Llama 8B Q4, $849; canonical https://www.gpuhunter.io/gpu/rtx-3090-ti - GeForce RTX 3060 12GB (rtx-3060-12gb): NVIDIA Ampere, 12GB VRAM, 360 GB/s bandwidth, 40 tok/s Llama 8B Q4, $249; canonical https://www.gpuhunter.io/gpu/rtx-3060-12gb - Radeon RX 7900 XTX (rx-7900-xtx): AMD RDNA 3, 24GB VRAM, 960 GB/s bandwidth, 66 tok/s Llama 8B Q4, $849; canonical https://www.gpuhunter.io/gpu/rx-7900-xtx - Radeon RX 9070 XT (rx-9070-xt): AMD RDNA 4, 16GB VRAM, 512 GB/s bandwidth, 56 tok/s Llama 8B Q4, $549; canonical https://www.gpuhunter.io/gpu/rx-9070-xt - Intel Arc B580 (arc-b580): Intel Xe2-HPG, 12GB VRAM, 456 GB/s bandwidth, 35 tok/s Llama 8B Q4, $249; canonical https://www.gpuhunter.io/gpu/arc-b580 - Apple M4 Pro (m4-pro): Apple M4 Pro, 48GB VRAM, 273 GB/s bandwidth, 51 tok/s Llama 8B Q4, $2499; canonical https://www.gpuhunter.io/gpu/m4-pro - NVIDIA RTX A6000 (rtx-a6000): NVIDIA Ampere, 48GB VRAM, 768 GB/s bandwidth, 73 tok/s Llama 8B Q4, $2499; canonical https://www.gpuhunter.io/gpu/rtx-a6000 - NVIDIA RTX 6000 Ada (rtx-6000-ada): NVIDIA Ada Lovelace, 48GB VRAM, 960 GB/s bandwidth, 95 tok/s Llama 8B Q4, $6800; canonical https://www.gpuhunter.io/gpu/rtx-6000-ada ## Model VRAM Requirements - Qwen3 32B: 128k context; VRAM required Q4=19GB, Q8=36GB, FP16=64GB - Qwen3 72B: 128k context; VRAM required Q4=42GB, Q8=78GB, FP16=144GB - Qwen3 235B: 128k context; VRAM required Q4=132GB, Q8=240GB, FP16=470GB - Llama 3.3 70B: 128k context; VRAM required Q4=40GB, Q8=75GB, FP16=140GB - DeepSeek V3: 128k context; VRAM required Q4=380GB, Q8=700GB, FP16=1300GB - Llama 3.1 8B: 128k context; VRAM required Q4=5GB, Q8=9GB, FP16=16GB - Qwen3 14B: 128k context; VRAM required Q4=8GB, Q8=15GB, FP16=28GB - Mistral 7B: 32k context; VRAM required Q4=4GB, Q8=8GB, FP16=14GB - Gemma 2 27B: 8k context; VRAM required Q4=16GB, Q8=30GB, FP16=54GB - Codestral 22B: 32k context; VRAM required Q4=13GB, Q8=24GB, FP16=44GB ## Research Topic URLs - https://www.gpuhunter.io/research/llm-quantization-papers — Best LLM Quantization Papers for Local AI Inference. Primary keyword: LLM quantization papers. Papers: 2603.08747, 2509.23202, 2601.14277, 2306.00978, 2210.17323, 2211.10438, 2305.14314, 2404.00456, 2401.06118. - https://www.gpuhunter.io/research/kv-cache-optimization-papers — KV Cache Optimization Papers for Long-Context LLM Inference. Primary keyword: KV cache optimization papers. Papers: 2605.17757, 2604.19157, 2604.05012, 2603.20397, 2602.08005, 2508.10395, 2504.19874, 2502.04420, 2511.01815, 2510.09665, 2401.18079, 2402.02750, 2306.14048. - https://www.gpuhunter.io/research/local-ai-inference-papers — Local AI Inference Papers for GPUs, llama.cpp and Apple Silicon. Primary keyword: local AI inference papers. Papers: 2605.23057, 2601.14277, 2508.08531, 2506.20187, 2506.03296, 2312.12456, 2303.06865, 2312.11514. - https://www.gpuhunter.io/research/gpu-inference-optimization-papers — GPU Inference Optimization Papers: Memory Bandwidth, Kernels & Serving. Primary keyword: GPU inference optimization papers. Papers: 2503.08311, 2604.02556, 2601.00227, 2205.14135, 2307.08691, 2311.01282, 2408.11743, 2312.11918. - https://www.gpuhunter.io/research/llm-serving-systems-papers — LLM Serving Systems Papers for vLLM, PagedAttention and Scheduling. Primary keyword: LLM serving systems papers. Papers: 2504.19720, 2309.06180, 2601.11580, 2605.23057, 2603.10031, 2602.00328, 2510.09665, 2510.18672, 2308.16369, 2403.02310. - https://www.gpuhunter.io/research/2026-llm-inference-papers — 2026 LLM Inference Papers: Quantization, KV Cache & GPU Systems. Primary keyword: 2026 LLM inference papers. Papers: 2605.23057, 2605.17757, 2604.19157, 2604.05012, 2604.02556, 2603.10031, 2603.08747, 2603.20397, 2602.08005, 2602.00328, 2601.14277, 2601.11580, 2604.04722. ## Research Year URLs - https://www.gpuhunter.io/research/year/2026 — 2026 LLM Inference Research Papers. Papers: 2605.23057, 2605.17757, 2604.19157, 2604.05012, 2604.04722, 2604.02556, 2603.20397, 2603.10031, 2603.08747, 2602.08005, 2602.00328, 2601.14277, 2601.11580, 2601.07475, 2601.00227. - https://www.gpuhunter.io/research/year/2025 — 2025 LLM Inference Research Papers. Papers: 2511.01815, 2510.18672, 2510.09665, 2509.23202, 2508.10395, 2508.08531, 2506.20187, 2506.03296, 2504.19874, 2504.19720, 2503.08311, 2502.04420. ## Top Research Paper URLs - https://www.gpuhunter.io/research/papers/2605.23057 — ModeSwitch-LLM: A Lightweight Phase-Aware Controller for Cross-Mode LLM Inference on a Single GPU. Local Inference, 2026. Primary keyword: single GPU LLM inference optimization. Takeaway: A desktop GPU can serve different prompts better when the runtime switches strategy per request instead of locking into one mode.. Sources: https://arxiv.org/abs/2605.23057, https://arxiv.org/pdf/2605.23057, https://huggingface.co/papers/2605.23057. - https://www.gpuhunter.io/research/papers/2605.17757 — OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization. KV Cache, 2026. Primary keyword: 2-bit KV cache quantization. Takeaway: Long-context claims on consumer GPUs will increasingly depend on KV-cache-specific quantization, not just weight quantization.. Sources: https://arxiv.org/abs/2605.17757, https://arxiv.org/pdf/2605.17757, https://huggingface.co/papers/2605.17757. - https://www.gpuhunter.io/research/papers/2604.19157 — SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving. KV Cache, 2026. Primary keyword: 4-bit KV cache quantization. Takeaway: For vLLM-style serving, deployability matters as much as compression ratio; the cache format has to fit the kernel path.. Sources: https://arxiv.org/abs/2604.19157, https://arxiv.org/pdf/2604.19157, https://huggingface.co/papers/2604.19157. - https://www.gpuhunter.io/research/papers/2604.05012 — Comparative Characterization of KV Cache Management Strategies for LLM Inference. KV Cache, 2026. Primary keyword: KV cache management strategies. Takeaway: The right KV cache strategy depends on workload shape; no one approach wins across every GPU, context length, and batch size.. Sources: https://arxiv.org/abs/2604.05012, https://arxiv.org/pdf/2604.05012, https://huggingface.co/papers/2604.05012. - https://www.gpuhunter.io/research/papers/2604.02556 — Fast NF4 Dequantization Kernels for Large Language Model Inference. Kernels, 2026. Primary keyword: NF4 dequantization kernels. Takeaway: Quantized model size is not enough; GPU Hunter benchmarks should care whether the backend has fast unpacking and dequant kernels.. Sources: https://arxiv.org/abs/2604.02556, https://arxiv.org/pdf/2604.02556, https://huggingface.co/papers/2604.02556. - https://www.gpuhunter.io/research/papers/2603.10031 — Architecture-Aware LLM Inference Optimization on AMD Instinct GPUs: A Comprehensive Benchmark and Deployment Study. Serving, 2026. Primary keyword: AMD Instinct LLM inference optimization. Takeaway: For AMD datacenter GPUs, model architecture, KV offload support, runtime kernels, and block size choices can dominate hardware specs.. Sources: https://arxiv.org/abs/2603.10031, https://arxiv.org/pdf/2603.10031, https://huggingface.co/papers/2603.10031. - https://www.gpuhunter.io/research/papers/2603.08747 — Diagnosing FP4 inference: a layer-wise and block-wise sensitivity analysis of NVFP4 and MXFP4. Quantization, 2026. Primary keyword: FP4 inference sensitivity analysis. Takeaway: Treat FP4 as a hardware capability that still needs model-aware quantization policy, not a universal speed switch.. Sources: https://arxiv.org/abs/2603.08747, https://arxiv.org/pdf/2603.08747, https://huggingface.co/papers/2603.08747. - https://www.gpuhunter.io/research/papers/2603.20397 — KV Cache Optimization Strategies for Scalable and Efficient LLM Inference. KV Cache, 2026. Primary keyword: KV cache optimization strategies. Takeaway: Use this to decide whether a workload needs cache compression, offload, eviction, or a hybrid policy before buying more VRAM.. Sources: https://arxiv.org/abs/2603.20397, https://arxiv.org/pdf/2603.20397, https://huggingface.co/papers/2603.20397. - https://www.gpuhunter.io/research/papers/2602.08005 — DeltaKV: Residual-Based KV Cache Compression via Long-Range Similarity. KV Cache, 2026. Primary keyword: residual KV cache compression. Takeaway: For agent and long-document workloads, cache compression quality may matter more than raw single-token decode speed.. Sources: https://arxiv.org/abs/2602.08005, https://arxiv.org/pdf/2602.08005, https://huggingface.co/papers/2602.08005. - https://www.gpuhunter.io/research/papers/2602.00328 — Harvest: Opportunistic Peer-to-Peer GPU Caching for LLM Inference. Serving, 2026. Primary keyword: peer-to-peer GPU caching for LLM inference. Takeaway: For dual-GPU and workstation setups, interconnect bandwidth and cache placement can change whether extra GPUs actually help.. Sources: https://arxiv.org/abs/2602.00328, https://arxiv.org/pdf/2602.00328, https://huggingface.co/papers/2602.00328. - https://www.gpuhunter.io/research/papers/2601.14277 — Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct. Local Inference, 2026. Primary keyword: llama.cpp quantization evaluation. Takeaway: A GPU recommendation is incomplete without naming the quantization formats that are realistic for that VRAM tier.. Sources: https://arxiv.org/abs/2601.14277, https://arxiv.org/pdf/2601.14277, https://huggingface.co/papers/2601.14277. - https://www.gpuhunter.io/research/papers/2601.11580 — Speculative Decoding: Performance or Illusion?. Serving, 2026. Primary keyword: speculative decoding performance. Takeaway: Use speculative decoding carefully: the speedup depends on workload, draft method, batch size, and engine implementation.. Sources: https://arxiv.org/abs/2601.11580, https://arxiv.org/pdf/2601.11580, https://huggingface.co/papers/2601.11580. - https://www.gpuhunter.io/research/papers/2509.23202 — Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization. Quantization, 2025. Primary keyword: microscaling FP4 quantization. Takeaway: RTX 5090 and Blackwell FP4 claims should be judged against real FP4 kernels and accuracy, not just advertised tensor formats.. Sources: https://arxiv.org/abs/2509.23202, https://arxiv.org/pdf/2509.23202, https://huggingface.co/papers/2509.23202. - https://www.gpuhunter.io/research/papers/2508.08531 — Profiling Large Language Model Inference on Apple Silicon: A Quantization Perspective. Local Inference, 2025. Primary keyword: Apple Silicon LLM inference profiling. Takeaway: Mac recommendations should compare quantized throughput, memory pressure, and unified-memory behavior against CUDA GPUs.. Sources: https://arxiv.org/abs/2508.08531, https://arxiv.org/pdf/2508.08531, https://huggingface.co/papers/2508.08531. - https://www.gpuhunter.io/research/papers/2508.10395 — XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization. KV Cache, 2025. Primary keyword: KV cache rematerialization. Takeaway: A faster GPU is not always the answer; sometimes spending compute to reduce memory movement is the better trade.. Sources: https://arxiv.org/abs/2508.10395, https://arxiv.org/pdf/2508.10395, https://huggingface.co/papers/2508.10395. - https://www.gpuhunter.io/research/papers/2506.20187 — Breaking the Boundaries of Long-Context LLM Inference: Adaptive KV Management on a Single Commodity GPU. Local Inference, 2025. Primary keyword: long-context LLM inference on one GPU. Takeaway: A single desktop GPU can handle longer contexts when the runtime manages GPU, CPU, and disk tiers deliberately.. Sources: https://arxiv.org/abs/2506.20187, https://arxiv.org/pdf/2506.20187, https://huggingface.co/papers/2506.20187. - https://www.gpuhunter.io/research/papers/2506.03296 — Parallel CPU-GPU Execution for LLM Inference on Constrained GPUs. Local Inference, 2025. Primary keyword: CPU GPU execution for constrained LLM inference. Takeaway: Offload is only useful when CPU work, GPU kernels, and transfers overlap; otherwise it just makes a barely fitting model slow.. Sources: https://arxiv.org/abs/2506.03296, https://arxiv.org/pdf/2506.03296, https://huggingface.co/papers/2506.03296. - https://www.gpuhunter.io/research/papers/2504.19874 — TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate. KV Cache, 2025. Primary keyword: online vector quantization for KV cache. Takeaway: KV cache compression can expand usable context on the same GPU, but implementation quality determines whether the promise reaches local runtimes.. Sources: https://arxiv.org/abs/2504.19874, https://arxiv.org/pdf/2504.19874, https://huggingface.co/papers/2504.19874. - https://www.gpuhunter.io/research/papers/2504.19720 — Taming the Titans: A Survey of Efficient LLM Inference Serving. Serving, 2025. Primary keyword: efficient LLM inference serving survey. Takeaway: Use this as a 2025 baseline for the serving stack before comparing vLLM, SGLang, TensorRT-LLM, and local runtimes.. Sources: https://arxiv.org/abs/2504.19720, https://arxiv.org/pdf/2504.19720, https://huggingface.co/papers/2504.19720. - https://www.gpuhunter.io/research/papers/2503.08311 — Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference. Kernels, 2025. Primary keyword: GPU memory bandwidth LLM inference. Takeaway: Batch-size scaling should be benchmarked against memory bandwidth behavior, not inferred from TFLOPS alone.. Sources: https://arxiv.org/abs/2503.08311, https://arxiv.org/pdf/2503.08311, https://huggingface.co/papers/2503.08311. - https://www.gpuhunter.io/research/papers/2502.04420 — KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference. KV Cache, 2025. Primary keyword: mixed precision KV cache quantization. Takeaway: KV cache quantization should be model-aware; uniform low-bit settings can waste quality or memory depending on the layer.. Sources: https://arxiv.org/abs/2502.04420, https://arxiv.org/pdf/2502.04420, https://huggingface.co/papers/2502.04420. - https://www.gpuhunter.io/research/papers/2511.01815 — KV Cache Transform Coding for Compact Storage in LLM Inference. KV Cache, 2025. Primary keyword: KV cache transform coding. Takeaway: Persistent KV cache storage can become a product feature for coding agents and local chat apps, not just a runtime optimization.. Sources: https://arxiv.org/abs/2511.01815, https://arxiv.org/pdf/2511.01815, https://huggingface.co/papers/2511.01815. - https://www.gpuhunter.io/research/papers/2510.09665 — LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference. KV Cache, 2025. Primary keyword: LMCache KV cache layer. Takeaway: For hosted local-AI products, cache reuse and prefill-decode disaggregation can beat simply adding more GPUs.. Sources: https://arxiv.org/abs/2510.09665, https://arxiv.org/pdf/2510.09665, https://huggingface.co/papers/2510.09665. - https://www.gpuhunter.io/research/papers/2510.18672 — Reasoning Language Model Inference Serving Unveiled: An Empirical Study. Serving, 2025. Primary keyword: reasoning model inference serving. Takeaway: Quantization and speculative decoding can help reasoning workloads, but prefix caching and KV quantization may not always pay off.. Sources: https://arxiv.org/abs/2510.18672, https://arxiv.org/pdf/2510.18672, https://huggingface.co/papers/2510.18672. - https://www.gpuhunter.io/research/papers/2604.04722 — Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs. KV Cache, 2026. Primary keyword: adaptive KV cache quantization. Takeaway: Small GPUs, laptops, and edge boxes need cache precision policies that spend memory only where the model is sensitive.. Sources: https://arxiv.org/abs/2604.04722, https://arxiv.org/pdf/2604.04722, https://huggingface.co/papers/2604.04722. ## GPU Hunter Research Conclusions - Quantization affects GPU choice because it changes VRAM fit, but speed depends on dequantization and runtime kernel support. - KV cache optimization affects long-context inference because cache memory can dominate after model weights fit. - Memory bandwidth and kernel maturity are practical inference features, not secondary specs. - Serving systems need batching, scheduling, cache reuse, and workload-aware runtime decisions; single-stream tok/s is not enough for shared GPU servers. - Apple Silicon, ROCm, CUDA, and Blackwell FP4 should be evaluated by backend support and workload fit, not only by spec sheets. ## Canonical Routes - https://www.gpuhunter.io/browse - https://www.gpuhunter.io/compare - https://www.gpuhunter.io/research - https://www.gpuhunter.io/research/llm-quantization-papers - https://www.gpuhunter.io/research/kv-cache-optimization-papers - https://www.gpuhunter.io/research/local-ai-inference-papers - https://www.gpuhunter.io/research/gpu-inference-optimization-papers - https://www.gpuhunter.io/research/llm-serving-systems-papers - https://www.gpuhunter.io/research/2026-llm-inference-papers - https://www.gpuhunter.io/research/year/2026 - https://www.gpuhunter.io/research/year/2025