research/2026-llm-inference-papers

research cluster / 2026

2026 LLM Inference Papers: Quantization, KV Cache & GPU Systems

Fresh 2026 LLM inference papers on FP4, KV cache quantization, local inference controllers, GPU kernels, AMD serving, and long-context systems.

Updated May 27, 202613 curated papersPrimary keyword: 2026 LLM inference papers

Browse GPUs by VRAM, bandwidth, and price Compare GPUs side by side Read the 2026 local AI GPU buying guide

01 // editorial context

The 2026 inference literature is shifting from generic model compression into deployment-specific optimization. The strongest papers focus on KV cache formats, phase-aware local inference, FP4 sensitivity, and kernels that make quantization useful in practice.

GPU Hunter uses this page as the freshness layer for visitors who want current arXiv work before choosing a GPU, comparing runtimes, or deciding whether a new hardware feature is worth paying for.

02 // how this changes GPU choice

# Blackwell FP4 needs model-aware quantization and backend support; read the FP4 papers before overvaluing the spec.
# Long-context local inference depends increasingly on KV cache quantization and management rather than weight quantization alone.
# Serving and local single-GPU workloads are converging around adaptive runtime decisions.

starter papers

Read these first if you want the fastest path from research to a hardware decision.

KV Cache2026Starter

Comparative Characterization of KV Cache Management Strategies for LLM Inference

Oteo Mamo, Olga Kogiou, Hyunjin Yi, Weikuan Yu

An empirical comparison of KV cache management frameworks including vLLM, InfiniGen, and H2O across latency, throughput, memory, request rates, and model sizes.

GPU Hunter takeaway: The right KV cache strategy depends on workload shape; no one approach wins across every GPU, context length, and batch size.

Summary arXiv PDF HF

Local Inference2026Starter

Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct

Uygar Kurt

A unified empirical evaluation of llama.cpp quantization formats for Llama-3.1-8B-Instruct on commodity hardware.

GPU Hunter takeaway: A GPU recommendation is incomplete without naming the quantization formats that are realistic for that VRAM tier.

Summary arXiv PDF HF

Local Inference2026Intermediate

ModeSwitch-LLM: A Lightweight Phase-Aware Controller for Cross-Mode LLM Inference on a Single GPU

Aman Sunesh, Ali Alshehhi, Hivansh Dhakne

A single-GPU controller that routes each request across FP16, quantized, speculative, prefix-cached, and batched inference modes using cheap workload features.

GPU Hunter takeaway: A desktop GPU can serve different prompts better when the runtime switches strategy per request instead of locking into one mode.

Summary arXiv PDF HF

advanced papers

These papers go deeper into kernels, cache policy, low-bit formats, or serving tradeoffs.

KV Cache2026Advanced

OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

Zhongzhu Zhou, Donglin Zhuang, Jisen Li, Ziyan Chen

A 2-bit KV cache quantization method that derives offline rotations and clipping thresholds from attention-aware covariance structure.

GPU Hunter takeaway: Long-context claims on consumer GPUs will increasingly depend on KV-cache-specific quantization, not just weight quantization.

Summary arXiv PDF HF

KV Cache2026Advanced

SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving

Jinda Jia, Jisen Li, Zhongzhu Zhou, Jung Hwan Heo, Jue Wang, Tri Dao

A practical INT4 KV-cache method designed around paged layouts, regular memory access, and fused attention execution.

GPU Hunter takeaway: For vLLM-style serving, deployability matters as much as compression ratio; the cache format has to fit the kernel path.

Summary arXiv PDF HF

Kernels2026Advanced

Fast NF4 Dequantization Kernels for Large Language Model Inference

Xiangbo Qi, Chaoyi Jiang, Murali Annavaram

A shared-memory kernel optimization for accelerating NF4 dequantization on NVIDIA GPUs during quantized LLM inference.

GPU Hunter takeaway: Quantized model size is not enough; GPU Hunter benchmarks should care whether the backend has fast unpacking and dequant kernels.

Summary arXiv PDF HF

full curated set

The complete paper set for this topic, with source links and GPU Hunter takeaways.

Local Inference2026Intermediate

ModeSwitch-LLM: A Lightweight Phase-Aware Controller for Cross-Mode LLM Inference on a Single GPU

Aman Sunesh, Ali Alshehhi, Hivansh Dhakne

A single-GPU controller that routes each request across FP16, quantized, speculative, prefix-cached, and batched inference modes using cheap workload features.

GPU Hunter takeaway: A desktop GPU can serve different prompts better when the runtime switches strategy per request instead of locking into one mode.

Summary arXiv PDF HF

KV Cache2026Advanced

OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

Zhongzhu Zhou, Donglin Zhuang, Jisen Li, Ziyan Chen

A 2-bit KV cache quantization method that derives offline rotations and clipping thresholds from attention-aware covariance structure.

GPU Hunter takeaway: Long-context claims on consumer GPUs will increasingly depend on KV-cache-specific quantization, not just weight quantization.

Summary arXiv PDF HF

KV Cache2026Advanced

SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving

Jinda Jia, Jisen Li, Zhongzhu Zhou, Jung Hwan Heo, Jue Wang, Tri Dao

A practical INT4 KV-cache method designed around paged layouts, regular memory access, and fused attention execution.

GPU Hunter takeaway: For vLLM-style serving, deployability matters as much as compression ratio; the cache format has to fit the kernel path.

Summary arXiv PDF HF

KV Cache2026Starter

Comparative Characterization of KV Cache Management Strategies for LLM Inference

Oteo Mamo, Olga Kogiou, Hyunjin Yi, Weikuan Yu

An empirical comparison of KV cache management frameworks including vLLM, InfiniGen, and H2O across latency, throughput, memory, request rates, and model sizes.

GPU Hunter takeaway: The right KV cache strategy depends on workload shape; no one approach wins across every GPU, context length, and batch size.

Summary arXiv PDF HF

Kernels2026Advanced

Fast NF4 Dequantization Kernels for Large Language Model Inference

Xiangbo Qi, Chaoyi Jiang, Murali Annavaram

A shared-memory kernel optimization for accelerating NF4 dequantization on NVIDIA GPUs during quantized LLM inference.

GPU Hunter takeaway: Quantized model size is not enough; GPU Hunter benchmarks should care whether the backend has fast unpacking and dequant kernels.

Summary arXiv PDF HF

Serving2026Intermediate

Architecture-Aware LLM Inference Optimization on AMD Instinct GPUs: A Comprehensive Benchmark and Deployment Study

Athos Georgiou

A production inference benchmark on AMD Instinct MI325X GPUs across large dense, MoE, MLA, and GQA model families using vLLM.

GPU Hunter takeaway: For AMD datacenter GPUs, model architecture, KV offload support, runtime kernels, and block size choices can dominate hardware specs.

Summary arXiv PDF HF

Quantization2026Intermediate

Diagnosing FP4 inference: a layer-wise and block-wise sensitivity analysis of NVFP4 and MXFP4

Musa Cim, Burak Topcu, Mahmut Taylan Kandemir

A component-wise sensitivity study of NVFP4 and MXFP4 quantization across Qwen2.5 model scales and transformer blocks.

GPU Hunter takeaway: Treat FP4 as a hardware capability that still needs model-aware quantization policy, not a universal speed switch.

Summary arXiv PDF HF

KV Cache2026Starter

KV Cache Optimization Strategies for Scalable and Efficient LLM Inference

Yichun Xu, Navjot K. Khaira, Tejinder Singh

A survey that organizes KV cache optimization into eviction, compression, hybrid memory, novel attention, and combined strategies.

GPU Hunter takeaway: Use this to decide whether a workload needs cache compression, offload, eviction, or a hybrid policy before buying more VRAM.

Summary arXiv PDF HF

KV Cache2026Advanced

DeltaKV: Residual-Based KV Cache Compression via Long-Range Similarity

Jitai Hao, Qiang Huang, Yaowei Wang, Min Zhang, Jun Yu

A residual KV cache compression framework that exploits long-range inter-token similarity and shared latent components.

GPU Hunter takeaway: For agent and long-document workloads, cache compression quality may matter more than raw single-token decode speed.

Summary arXiv PDF HF

Serving2026Intermediate

Harvest: Opportunistic Peer-to-Peer GPU Caching for LLM Inference

Nikhil Gopal, Kostis Kaffes

A peer-to-peer GPU cache management framework that uses high-bandwidth GPU interconnects to reduce host-memory offload latency.

GPU Hunter takeaway: For dual-GPU and workstation setups, interconnect bandwidth and cache placement can change whether extra GPUs actually help.

Summary arXiv PDF HF

Local Inference2026Starter

Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct

Uygar Kurt

A unified empirical evaluation of llama.cpp quantization formats for Llama-3.1-8B-Instruct on commodity hardware.

GPU Hunter takeaway: A GPU recommendation is incomplete without naming the quantization formats that are realistic for that VRAM tier.

Summary arXiv PDF HF

Serving2026Intermediate

Speculative Decoding: Performance or Illusion?

Xiaoxuan Liu, Jiaxiang Yu, Jongseok Park, Ion Stoica, Alvin Cheung

A production-grade vLLM study of speculative decoding variants across workloads, model scales, and batch sizes.

GPU Hunter takeaway: Use speculative decoding carefully: the speedup depends on workload, draft method, batch size, and engine implementation.

Summary arXiv PDF HF

KV Cache2026Intermediate

Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs

Sayed Pedram Haeri Boroujeni, Niloufar Mehrabi, Patrick Woods, Gabriel Hillesheim, Abolfazl Razi

An adaptive KV-cache quantization approach for mobile, embedded, and edge LLM inference where memory bandwidth and cache growth dominate.

GPU Hunter takeaway: Small GPUs, laptops, and edge boxes need cache precision policies that spend memory only where the model is sensitive.

Summary arXiv PDF HF

related GPUs

GeForce RTX 5090: 32GB, 1792 GB/s RTX PRO 6000 Blackwell: 96GB, 1792 GB/s Apple M3 Ultra: 512GB, 819 GB/s

related buying guides

Best GPUs for Local AI in 2026 RTX PRO 6000 Blackwell vs H100

FAQ

What changed in 2026 LLM inference research?

The strongest 2026 work is more systems-aware: FP4 is evaluated by sensitivity and kernels, while KV cache papers focus on deployable memory layouts.

Which 2026 papers matter most for GPU buyers?

Start with ModeSwitch-LLM, OSCAR, SAW-INT4, the llama.cpp quantization evaluation, and the NF4 dequantization kernel paper.

Source pages are linked directly to arXiv, PDFs, and Hugging Face Papers where available. GPU Hunter summaries are editorial context, not copies of the original abstracts.