research/2026-llm-inference-papers
research cluster / 2026

2026 LLM Inference Papers: Quantization, KV Cache & GPU Systems

Fresh 2026 LLM inference papers on FP4, KV cache quantization, local inference controllers, GPU kernels, AMD serving, and long-context systems.

Updated May 27, 202613 curated papersPrimary keyword: 2026 LLM inference papers
Browse GPUs by VRAM, bandwidth, and price Compare GPUs side by side Read the 2026 local AI GPU buying guide
01  //  editorial context

The 2026 inference literature is shifting from generic model compression into deployment-specific optimization. The strongest papers focus on KV cache formats, phase-aware local inference, FP4 sensitivity, and kernels that make quantization useful in practice.

GPU Hunter uses this page as the freshness layer for visitors who want current arXiv work before choosing a GPU, comparing runtimes, or deciding whether a new hardware feature is worth paying for.

02  //  how this changes GPU choice
  • # Blackwell FP4 needs model-aware quantization and backend support; read the FP4 papers before overvaluing the spec.
  • # Long-context local inference depends increasingly on KV cache quantization and management rather than weight quantization alone.
  • # Serving and local single-GPU workloads are converging around adaptive runtime decisions.
starter papers

Read these first if you want the fastest path from research to a hardware decision.

advanced papers

These papers go deeper into kernels, cache policy, low-bit formats, or serving tradeoffs.

full curated set

The complete paper set for this topic, with source links and GPU Hunter takeaways.

Serving2026Intermediate

Speculative Decoding: Performance or Illusion?

Xiaoxuan Liu, Jiaxiang Yu, Jongseok Park, Ion Stoica, Alvin Cheung

A production-grade vLLM study of speculative decoding variants across workloads, model scales, and batch sizes.

GPU Hunter takeaway: Use speculative decoding carefully: the speedup depends on workload, draft method, batch size, and engine implementation.

Summary arXiv PDF HF
FAQ

What changed in 2026 LLM inference research?

The strongest 2026 work is more systems-aware: FP4 is evaluated by sensitivity and kernels, while KV cache papers focus on deployable memory layouts.

Which 2026 papers matter most for GPU buyers?

Start with ModeSwitch-LLM, OSCAR, SAW-INT4, the llama.cpp quantization evaluation, and the NF4 dequantization kernel paper.

Source pages are linked directly to arXiv, PDFs, and Hugging Face Papers where available. GPU Hunter summaries are editorial context, not copies of the original abstracts.