research/kv-cache-optimization-papers
research cluster / KV cache

KV Cache Optimization Papers for Long-Context LLM Inference

Research papers on KV cache quantization, compression, eviction, offload, reuse, and long-context inference bottlenecks for local and served LLMs.

Updated May 27, 202613 curated papersPrimary keyword: KV cache optimization papers
Browse GPUs by VRAM, bandwidth, and price Compare GPUs side by side Read the 2026 local AI GPU buying guide
01  //  editorial context

The KV cache is the hidden memory bill behind long-context inference. Model weights may fit in VRAM, but every generated token grows the cache, and that cache can dominate memory pressure during long chats, coding sessions, and document workflows.

These papers cover the practical toolbox: quantization, compression, rematerialization, eviction, offload, reuse, and cache layers. GPU Hunter uses this research to avoid treating context length as a simple VRAM number.

02  //  how this changes GPU choice
  • # A 24GB card can be surprisingly useful when cache compression or eviction is acceptable, but it will hit context limits sooner than a 32GB or 96GB card.
  • # High-bandwidth GPUs help when the cache path is memory-bound. Unified memory can help with capacity, but throughput still depends on the memory hierarchy.
  • # For shared servers, cache reuse and prefix caching can reduce cost more than upgrading to the next GPU tier.
starter papers

Read these first if you want the fastest path from research to a hardware decision.

KV Cache2023Intermediate

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han

Introduces StreamingLLM, which keeps attention sink tokens plus a rolling window to support long streaming contexts.

GPU Hunter takeaway: For chat workloads, retaining the right cache entries can beat blindly growing context until VRAM runs out.

arXiv PDF HF
advanced papers

These papers go deeper into kernels, cache policy, low-bit formats, or serving tradeoffs.

KV Cache2025Advanced

KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference

Xing Li, Zeyu Xing, Yiming Li, Linping Qu, Hui-Ling Zhen, Wulong Liu

A mixed-precision KV-cache quantization framework that searches layer-wise key/value precision pairs under hardware-friendly constraints.

GPU Hunter takeaway: KV cache quantization should be model-aware; uniform low-bit settings can waste quality or memory depending on the layer.

Summary arXiv PDF HF
full curated set

The complete paper set for this topic, with source links and GPU Hunter takeaways.

KV Cache2025Advanced

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Amir Zandieh, Majid Daliri, Majid Hadian, Vahab Mirrokni

A near-optimal online vector quantization method used for KV-cache compression and inner-product-preserving low-bit representations.

GPU Hunter takeaway: KV cache compression can expand usable context on the same GPU, but implementation quality determines whether the promise reaches local runtimes.

Summary arXiv PDF HF
KV Cache2025Advanced

KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference

Xing Li, Zeyu Xing, Yiming Li, Linping Qu, Hui-Ling Zhen, Wulong Liu

A mixed-precision KV-cache quantization framework that searches layer-wise key/value precision pairs under hardware-friendly constraints.

GPU Hunter takeaway: KV cache quantization should be model-aware; uniform low-bit settings can waste quality or memory depending on the layer.

Summary arXiv PDF HF
KV Cache2025Advanced

KV Cache Transform Coding for Compact Storage in LLM Inference

Konrad Staniszewski, Adrian Lancucki

A lightweight transform coder that compresses reusable KV caches using PCA-style decorrelation, adaptive quantization, and entropy coding.

GPU Hunter takeaway: Persistent KV cache storage can become a product feature for coding agents and local chat apps, not just a runtime optimization.

Summary arXiv PDF HF
KV Cache2024Advanced

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Amir Gholami

A KV cache quantization approach with per-channel key quantization, pre-RoPE quantization, and non-uniform datatypes.

GPU Hunter takeaway: Context length claims are only credible if the KV cache footprint and decode speed are accounted for.

arXiv PDF HF
KV Cache2024Advanced

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Zirui Liu, Jiayi Yuan, Hongye Jin, Xia Hu

A 2-bit KV cache quantization method using different quantization layouts for key and value cache tensors.

GPU Hunter takeaway: KV cache precision should be part of any serious local inference benchmark at long context.

arXiv PDF HF
KV Cache2023Advanced

H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Beidi Chen

A KV cache eviction policy that keeps recent tokens and heavy hitters that contribute most to attention.

GPU Hunter takeaway: Eviction policy can be a better answer than buying more VRAM for every long-context workload.

arXiv PDF HF
FAQ

Why does KV cache matter for GPU choice?

KV cache grows with context length and active requests. It can consume the VRAM saved by weight quantization, so long-context workloads need cache-aware hardware and runtimes.

Is more VRAM always the best answer for long context?

More VRAM helps, but the papers show that cache quantization, eviction, offload, and reuse can change the best hardware choice for a workload.

Source pages are linked directly to arXiv, PDFs, and Hugging Face Papers where available. GPU Hunter summaries are editorial context, not copies of the original abstracts.