research/kv-cache-optimization-papers

research cluster / KV cache

KV Cache Optimization Papers for Long-Context LLM Inference

Research papers on KV cache quantization, compression, eviction, offload, reuse, and long-context inference bottlenecks for local and served LLMs.

Updated May 27, 202613 curated papersPrimary keyword: KV cache optimization papers

Browse GPUs by VRAM, bandwidth, and price Compare GPUs side by side Read the 2026 local AI GPU buying guide

01 // editorial context

The KV cache is the hidden memory bill behind long-context inference. Model weights may fit in VRAM, but every generated token grows the cache, and that cache can dominate memory pressure during long chats, coding sessions, and document workflows.

These papers cover the practical toolbox: quantization, compression, rematerialization, eviction, offload, reuse, and cache layers. GPU Hunter uses this research to avoid treating context length as a simple VRAM number.

02 // how this changes GPU choice

# A 24GB card can be surprisingly useful when cache compression or eviction is acceptable, but it will hit context limits sooner than a 32GB or 96GB card.
# High-bandwidth GPUs help when the cache path is memory-bound. Unified memory can help with capacity, but throughput still depends on the memory hierarchy.
# For shared servers, cache reuse and prefix caching can reduce cost more than upgrading to the next GPU tier.

starter papers

Read these first if you want the fastest path from research to a hardware decision.

KV Cache2026Starter

Comparative Characterization of KV Cache Management Strategies for LLM Inference

Oteo Mamo, Olga Kogiou, Hyunjin Yi, Weikuan Yu

An empirical comparison of KV cache management frameworks including vLLM, InfiniGen, and H2O across latency, throughput, memory, request rates, and model sizes.

GPU Hunter takeaway: The right KV cache strategy depends on workload shape; no one approach wins across every GPU, context length, and batch size.

Summary arXiv PDF HF

KV Cache2026Starter

KV Cache Optimization Strategies for Scalable and Efficient LLM Inference

Yichun Xu, Navjot K. Khaira, Tejinder Singh

A survey that organizes KV cache optimization into eviction, compression, hybrid memory, novel attention, and combined strategies.

GPU Hunter takeaway: Use this to decide whether a workload needs cache compression, offload, eviction, or a hybrid policy before buying more VRAM.

Summary arXiv PDF HF

KV Cache2023Intermediate

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han

Introduces StreamingLLM, which keeps attention sink tokens plus a rolling window to support long streaming contexts.

GPU Hunter takeaway: For chat workloads, retaining the right cache entries can beat blindly growing context until VRAM runs out.

advanced papers

These papers go deeper into kernels, cache policy, low-bit formats, or serving tradeoffs.

KV Cache2026Advanced

OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

Zhongzhu Zhou, Donglin Zhuang, Jisen Li, Ziyan Chen

A 2-bit KV cache quantization method that derives offline rotations and clipping thresholds from attention-aware covariance structure.

GPU Hunter takeaway: Long-context claims on consumer GPUs will increasingly depend on KV-cache-specific quantization, not just weight quantization.

Summary arXiv PDF HF

KV Cache2026Advanced

SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving

Jinda Jia, Jisen Li, Zhongzhu Zhou, Jung Hwan Heo, Jue Wang, Tri Dao

A practical INT4 KV-cache method designed around paged layouts, regular memory access, and fused attention execution.

GPU Hunter takeaway: For vLLM-style serving, deployability matters as much as compression ratio; the cache format has to fit the kernel path.

Summary arXiv PDF HF

KV Cache2025Advanced

KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference

Xing Li, Zeyu Xing, Yiming Li, Linping Qu, Hui-Ling Zhen, Wulong Liu

A mixed-precision KV-cache quantization framework that searches layer-wise key/value precision pairs under hardware-friendly constraints.

GPU Hunter takeaway: KV cache quantization should be model-aware; uniform low-bit settings can waste quality or memory depending on the layer.

Summary arXiv PDF HF

KV Cache2025Advanced

XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization

Aditya Tomar, Coleman Hooper, Minjae Lee, Haocheng Xi, Rishabh Tiwari, Wonjun Kang

A cache-rematerialization approach that trades extra computation for lower KV cache memory traffic and storage pressure.

GPU Hunter takeaway: A faster GPU is not always the answer; sometimes spending compute to reduce memory movement is the better trade.

Summary arXiv PDF HF

full curated set

The complete paper set for this topic, with source links and GPU Hunter takeaways.

KV Cache2026Advanced

OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

Zhongzhu Zhou, Donglin Zhuang, Jisen Li, Ziyan Chen

A 2-bit KV cache quantization method that derives offline rotations and clipping thresholds from attention-aware covariance structure.

GPU Hunter takeaway: Long-context claims on consumer GPUs will increasingly depend on KV-cache-specific quantization, not just weight quantization.

Summary arXiv PDF HF

KV Cache2026Advanced

SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving

Jinda Jia, Jisen Li, Zhongzhu Zhou, Jung Hwan Heo, Jue Wang, Tri Dao

A practical INT4 KV-cache method designed around paged layouts, regular memory access, and fused attention execution.

GPU Hunter takeaway: For vLLM-style serving, deployability matters as much as compression ratio; the cache format has to fit the kernel path.

Summary arXiv PDF HF

KV Cache2026Starter

Comparative Characterization of KV Cache Management Strategies for LLM Inference

Oteo Mamo, Olga Kogiou, Hyunjin Yi, Weikuan Yu

An empirical comparison of KV cache management frameworks including vLLM, InfiniGen, and H2O across latency, throughput, memory, request rates, and model sizes.

GPU Hunter takeaway: The right KV cache strategy depends on workload shape; no one approach wins across every GPU, context length, and batch size.

Summary arXiv PDF HF

KV Cache2026Starter

KV Cache Optimization Strategies for Scalable and Efficient LLM Inference

Yichun Xu, Navjot K. Khaira, Tejinder Singh

A survey that organizes KV cache optimization into eviction, compression, hybrid memory, novel attention, and combined strategies.

GPU Hunter takeaway: Use this to decide whether a workload needs cache compression, offload, eviction, or a hybrid policy before buying more VRAM.

Summary arXiv PDF HF

KV Cache2026Advanced

DeltaKV: Residual-Based KV Cache Compression via Long-Range Similarity

Jitai Hao, Qiang Huang, Yaowei Wang, Min Zhang, Jun Yu

A residual KV cache compression framework that exploits long-range inter-token similarity and shared latent components.

GPU Hunter takeaway: For agent and long-document workloads, cache compression quality may matter more than raw single-token decode speed.

Summary arXiv PDF HF

KV Cache2025Advanced

XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization

Aditya Tomar, Coleman Hooper, Minjae Lee, Haocheng Xi, Rishabh Tiwari, Wonjun Kang

A cache-rematerialization approach that trades extra computation for lower KV cache memory traffic and storage pressure.

GPU Hunter takeaway: A faster GPU is not always the answer; sometimes spending compute to reduce memory movement is the better trade.

Summary arXiv PDF HF

KV Cache2025Advanced

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Amir Zandieh, Majid Daliri, Majid Hadian, Vahab Mirrokni

A near-optimal online vector quantization method used for KV-cache compression and inner-product-preserving low-bit representations.

GPU Hunter takeaway: KV cache compression can expand usable context on the same GPU, but implementation quality determines whether the promise reaches local runtimes.

Summary arXiv PDF HF

KV Cache2025Advanced

KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference

Xing Li, Zeyu Xing, Yiming Li, Linping Qu, Hui-Ling Zhen, Wulong Liu

A mixed-precision KV-cache quantization framework that searches layer-wise key/value precision pairs under hardware-friendly constraints.

GPU Hunter takeaway: KV cache quantization should be model-aware; uniform low-bit settings can waste quality or memory depending on the layer.

Summary arXiv PDF HF

KV Cache2025Advanced

KV Cache Transform Coding for Compact Storage in LLM Inference

Konrad Staniszewski, Adrian Lancucki

A lightweight transform coder that compresses reusable KV caches using PCA-style decorrelation, adaptive quantization, and entropy coding.

GPU Hunter takeaway: Persistent KV cache storage can become a product feature for coding agents and local chat apps, not just a runtime optimization.

Summary arXiv PDF HF

KV Cache2025Intermediate

LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference

Yuhan Liu, Yihua Cheng, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaoting Feng

An open-source KV cache layer for extracting, storing, offloading, transferring, and reusing caches across vLLM and SGLang engines.

GPU Hunter takeaway: For hosted local-AI products, cache reuse and prefill-decode disaggregation can beat simply adding more GPUs.

Summary arXiv PDF HF

KV Cache2024Advanced

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Amir Gholami

A KV cache quantization approach with per-channel key quantization, pre-RoPE quantization, and non-uniform datatypes.

GPU Hunter takeaway: Context length claims are only credible if the KV cache footprint and decode speed are accounted for.

KV Cache2024Advanced

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Zirui Liu, Jiayi Yuan, Hongye Jin, Xia Hu

A 2-bit KV cache quantization method using different quantization layouts for key and value cache tensors.

GPU Hunter takeaway: KV cache precision should be part of any serious local inference benchmark at long context.

KV Cache2023Advanced

H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Beidi Chen

A KV cache eviction policy that keeps recent tokens and heavy hitters that contribute most to attention.

GPU Hunter takeaway: Eviction policy can be a better answer than buying more VRAM for every long-context workload.

related GPUs

GeForce RTX 3090: 24GB, 936 GB/s GeForce RTX 4090: 24GB, 1008 GB/s RTX PRO 6000 Blackwell: 96GB, 1792 GB/s

related buying guides

Best GPUs for Local AI in 2026 RTX PRO 6000 Blackwell vs H100

FAQ

Why does KV cache matter for GPU choice?

KV cache grows with context length and active requests. It can consume the VRAM saved by weight quantization, so long-context workloads need cache-aware hardware and runtimes.

Is more VRAM always the best answer for long context?

More VRAM helps, but the papers show that cache quantization, eviction, offload, and reuse can change the best hardware choice for a workload.

Source pages are linked directly to arXiv, PDFs, and Hugging Face Papers where available. GPU Hunter summaries are editorial context, not copies of the original abstracts.