research/year/2026

research archive / 2026

2026 LLM Inference Research Papers

Curated 2026 LLM inference research across local AI, FP4 quantization, KV cache optimization, kernels, AMD serving, and single-GPU systems.

Updated May 27, 202615 papers

why this year matters

This year page groups the latest 2026 papers in the GPU Hunter research library. The recurring theme is deployability: cache formats, kernel paths, adaptive runtime decisions, and hardware-aware quantization.

Use this page when you want a chronological view before moving into the topic clusters.

related clusters

2026 LLM inference papers KV cache optimization papers LLM quantization papers

curated papers

Local Inference2605.23057

ModeSwitch-LLM: A Lightweight Phase-Aware Controller for Cross-Mode LLM Inference on a Single GPU

A desktop GPU can serve different prompts better when the runtime switches strategy per request instead of locking into one mode.

Summary arXiv PDF

KV Cache2605.17757

OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

Long-context claims on consumer GPUs will increasingly depend on KV-cache-specific quantization, not just weight quantization.

Summary arXiv PDF

KV Cache2604.19157

SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving

For vLLM-style serving, deployability matters as much as compression ratio; the cache format has to fit the kernel path.

Summary arXiv PDF

KV Cache2604.05012

Comparative Characterization of KV Cache Management Strategies for LLM Inference

The right KV cache strategy depends on workload shape; no one approach wins across every GPU, context length, and batch size.

Summary arXiv PDF

KV Cache2604.04722

Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs

Small GPUs, laptops, and edge boxes need cache precision policies that spend memory only where the model is sensitive.

Summary arXiv PDF

Kernels2604.02556

Fast NF4 Dequantization Kernels for Large Language Model Inference

Quantized model size is not enough; GPU Hunter benchmarks should care whether the backend has fast unpacking and dequant kernels.

Summary arXiv PDF

KV Cache2603.20397

KV Cache Optimization Strategies for Scalable and Efficient LLM Inference

Use this to decide whether a workload needs cache compression, offload, eviction, or a hybrid policy before buying more VRAM.

Summary arXiv PDF

Serving2603.10031

Architecture-Aware LLM Inference Optimization on AMD Instinct GPUs: A Comprehensive Benchmark and Deployment Study

For AMD datacenter GPUs, model architecture, KV offload support, runtime kernels, and block size choices can dominate hardware specs.

Summary arXiv PDF

Quantization2603.08747

Diagnosing FP4 inference: a layer-wise and block-wise sensitivity analysis of NVFP4 and MXFP4

Treat FP4 as a hardware capability that still needs model-aware quantization policy, not a universal speed switch.

Summary arXiv PDF

KV Cache2602.08005

DeltaKV: Residual-Based KV Cache Compression via Long-Range Similarity

For agent and long-document workloads, cache compression quality may matter more than raw single-token decode speed.

Summary arXiv PDF

Serving2602.00328

Harvest: Opportunistic Peer-to-Peer GPU Caching for LLM Inference

For dual-GPU and workstation setups, interconnect bandwidth and cache placement can change whether extra GPUs actually help.

Summary arXiv PDF

Local Inference2601.14277

Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct

A GPU recommendation is incomplete without naming the quantization formats that are realistic for that VRAM tier.

Summary arXiv PDF

Serving2601.11580

Speculative Decoding: Performance or Illusion?

Use speculative decoding carefully: the speedup depends on workload, draft method, batch size, and engine implementation.

Summary arXiv PDF

Quantization2601.07475

ARCQuant: Boosting NVFP4 Quantization with Augmented Residual Channels for LLMs

Blackwell-class FP4 gains will depend on quantization methods designed around NVFP4's real block and precision rules.

arXiv PDF

Kernels2601.00227

FlashInfer-Bench: Building the Virtuous Cycle for AI-driven LLM Systems

Kernel maturity is a hardware feature in practice; the same GPU can behave very differently across inference backends.

arXiv PDF

Browse GPUs by VRAM, bandwidth, and price Compare GPUs side by side Read the 2026 local AI GPU buying guide