Local Inference2026Intermediate
Aman Sunesh, Ali Alshehhi, Hivansh Dhakne
A single-GPU controller that routes each request across FP16, quantized, speculative, prefix-cached, and batched inference modes using cheap workload features.
GPU Hunter takeaway: A desktop GPU can serve different prompts better when the runtime switches strategy per request instead of locking into one mode.
KV Cache2026Advanced
Zhongzhu Zhou, Donglin Zhuang, Jisen Li, Ziyan Chen
A 2-bit KV cache quantization method that derives offline rotations and clipping thresholds from attention-aware covariance structure.
GPU Hunter takeaway: Long-context claims on consumer GPUs will increasingly depend on KV-cache-specific quantization, not just weight quantization.
KV Cache2026Advanced
Jinda Jia, Jisen Li, Zhongzhu Zhou, Jung Hwan Heo, Jue Wang, Tri Dao
A practical INT4 KV-cache method designed around paged layouts, regular memory access, and fused attention execution.
GPU Hunter takeaway: For vLLM-style serving, deployability matters as much as compression ratio; the cache format has to fit the kernel path.
KV Cache2026Starter
Oteo Mamo, Olga Kogiou, Hyunjin Yi, Weikuan Yu
An empirical comparison of KV cache management frameworks including vLLM, InfiniGen, and H2O across latency, throughput, memory, request rates, and model sizes.
GPU Hunter takeaway: The right KV cache strategy depends on workload shape; no one approach wins across every GPU, context length, and batch size.
Kernels2026Advanced
Xiangbo Qi, Chaoyi Jiang, Murali Annavaram
A shared-memory kernel optimization for accelerating NF4 dequantization on NVIDIA GPUs during quantized LLM inference.
GPU Hunter takeaway: Quantized model size is not enough; GPU Hunter benchmarks should care whether the backend has fast unpacking and dequant kernels.
Serving2026Intermediate
Athos Georgiou
A production inference benchmark on AMD Instinct MI325X GPUs across large dense, MoE, MLA, and GQA model families using vLLM.
GPU Hunter takeaway: For AMD datacenter GPUs, model architecture, KV offload support, runtime kernels, and block size choices can dominate hardware specs.
Quantization2026Intermediate
Musa Cim, Burak Topcu, Mahmut Taylan Kandemir
A component-wise sensitivity study of NVFP4 and MXFP4 quantization across Qwen2.5 model scales and transformer blocks.
GPU Hunter takeaway: Treat FP4 as a hardware capability that still needs model-aware quantization policy, not a universal speed switch.
KV Cache2026Starter
Yichun Xu, Navjot K. Khaira, Tejinder Singh
A survey that organizes KV cache optimization into eviction, compression, hybrid memory, novel attention, and combined strategies.
GPU Hunter takeaway: Use this to decide whether a workload needs cache compression, offload, eviction, or a hybrid policy before buying more VRAM.
KV Cache2026Advanced
Jitai Hao, Qiang Huang, Yaowei Wang, Min Zhang, Jun Yu
A residual KV cache compression framework that exploits long-range inter-token similarity and shared latent components.
GPU Hunter takeaway: For agent and long-document workloads, cache compression quality may matter more than raw single-token decode speed.
Serving2026Intermediate
Nikhil Gopal, Kostis Kaffes
A peer-to-peer GPU cache management framework that uses high-bandwidth GPU interconnects to reduce host-memory offload latency.
GPU Hunter takeaway: For dual-GPU and workstation setups, interconnect bandwidth and cache placement can change whether extra GPUs actually help.
Local Inference2026Starter
Uygar Kurt
A unified empirical evaluation of llama.cpp quantization formats for Llama-3.1-8B-Instruct on commodity hardware.
GPU Hunter takeaway: A GPU recommendation is incomplete without naming the quantization formats that are realistic for that VRAM tier.
Serving2026Intermediate
Xiaoxuan Liu, Jiaxiang Yu, Jongseok Park, Ion Stoica, Alvin Cheung
A production-grade vLLM study of speculative decoding variants across workloads, model scales, and batch sizes.
GPU Hunter takeaway: Use speculative decoding carefully: the speedup depends on workload, draft method, batch size, and engine implementation.
KV Cache2026Intermediate
Sayed Pedram Haeri Boroujeni, Niloufar Mehrabi, Patrick Woods, Gabriel Hillesheim, Abolfazl Razi
An adaptive KV-cache quantization approach for mobile, embedded, and edge LLM inference where memory bandwidth and cache growth dominate.
GPU Hunter takeaway: Small GPUs, laptops, and edge boxes need cache precision policies that spend memory only where the model is sensitive.