KV Cache2026Advanced
Zhongzhu Zhou, Donglin Zhuang, Jisen Li, Ziyan Chen
A 2-bit KV cache quantization method that derives offline rotations and clipping thresholds from attention-aware covariance structure.
GPU Hunter takeaway: Long-context claims on consumer GPUs will increasingly depend on KV-cache-specific quantization, not just weight quantization.
KV Cache2026Advanced
Jinda Jia, Jisen Li, Zhongzhu Zhou, Jung Hwan Heo, Jue Wang, Tri Dao
A practical INT4 KV-cache method designed around paged layouts, regular memory access, and fused attention execution.
GPU Hunter takeaway: For vLLM-style serving, deployability matters as much as compression ratio; the cache format has to fit the kernel path.
KV Cache2026Starter
Oteo Mamo, Olga Kogiou, Hyunjin Yi, Weikuan Yu
An empirical comparison of KV cache management frameworks including vLLM, InfiniGen, and H2O across latency, throughput, memory, request rates, and model sizes.
GPU Hunter takeaway: The right KV cache strategy depends on workload shape; no one approach wins across every GPU, context length, and batch size.
KV Cache2026Starter
Yichun Xu, Navjot K. Khaira, Tejinder Singh
A survey that organizes KV cache optimization into eviction, compression, hybrid memory, novel attention, and combined strategies.
GPU Hunter takeaway: Use this to decide whether a workload needs cache compression, offload, eviction, or a hybrid policy before buying more VRAM.
KV Cache2026Advanced
Jitai Hao, Qiang Huang, Yaowei Wang, Min Zhang, Jun Yu
A residual KV cache compression framework that exploits long-range inter-token similarity and shared latent components.
GPU Hunter takeaway: For agent and long-document workloads, cache compression quality may matter more than raw single-token decode speed.
KV Cache2025Advanced
Aditya Tomar, Coleman Hooper, Minjae Lee, Haocheng Xi, Rishabh Tiwari, Wonjun Kang
A cache-rematerialization approach that trades extra computation for lower KV cache memory traffic and storage pressure.
GPU Hunter takeaway: A faster GPU is not always the answer; sometimes spending compute to reduce memory movement is the better trade.
KV Cache2025Advanced
Amir Zandieh, Majid Daliri, Majid Hadian, Vahab Mirrokni
A near-optimal online vector quantization method used for KV-cache compression and inner-product-preserving low-bit representations.
GPU Hunter takeaway: KV cache compression can expand usable context on the same GPU, but implementation quality determines whether the promise reaches local runtimes.
KV Cache2025Advanced
Xing Li, Zeyu Xing, Yiming Li, Linping Qu, Hui-Ling Zhen, Wulong Liu
A mixed-precision KV-cache quantization framework that searches layer-wise key/value precision pairs under hardware-friendly constraints.
GPU Hunter takeaway: KV cache quantization should be model-aware; uniform low-bit settings can waste quality or memory depending on the layer.
KV Cache2025Advanced
Konrad Staniszewski, Adrian Lancucki
A lightweight transform coder that compresses reusable KV caches using PCA-style decorrelation, adaptive quantization, and entropy coding.
GPU Hunter takeaway: Persistent KV cache storage can become a product feature for coding agents and local chat apps, not just a runtime optimization.
KV Cache2025Intermediate
Yuhan Liu, Yihua Cheng, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaoting Feng
An open-source KV cache layer for extracting, storing, offloading, transferring, and reusing caches across vLLM and SGLang engines.
GPU Hunter takeaway: For hosted local-AI products, cache reuse and prefill-decode disaggregation can beat simply adding more GPUs.
KV Cache2024Advanced
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Amir Gholami
A KV cache quantization approach with per-channel key quantization, pre-RoPE quantization, and non-uniform datatypes.
GPU Hunter takeaway: Context length claims are only credible if the KV cache footprint and decode speed are accounted for.
KV Cache2024Advanced
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
Zirui Liu, Jiayi Yuan, Hongye Jin, Xia Hu
A 2-bit KV cache quantization method using different quantization layouts for key and value cache tensors.
GPU Hunter takeaway: KV cache precision should be part of any serious local inference benchmark at long context.
KV Cache2023Advanced
H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Beidi Chen
A KV cache eviction policy that keeps recent tokens and heavy hitters that contribute most to attention.
GPU Hunter takeaway: Eviction policy can be a better answer than buying more VRAM for every long-context workload.