research/year/2025

research archive / 2025

2025 LLM Inference Research Papers

Curated 2025 LLM inference papers covering KV cache compression, Apple Silicon profiling, constrained GPUs, serving systems, and memory bottlenecks.

Updated May 27, 202612 papers

why this year matters

The 2025 papers in GPU Hunter's library establish many of the practical questions behind 2026 hardware decisions: when cache management matters, how Apple Silicon behaves, and why memory bandwidth limits large-batch inference.

Use this page to understand the bridge from foundational serving work to the newest 2026 optimization papers.

related clusters

local AI inference papers GPU inference optimization papers LLM serving systems papers

curated papers

KV Cache2511.01815

KV Cache Transform Coding for Compact Storage in LLM Inference

Persistent KV cache storage can become a product feature for coding agents and local chat apps, not just a runtime optimization.

Summary arXiv PDF

Serving2510.18672

Reasoning Language Model Inference Serving Unveiled: An Empirical Study

Quantization and speculative decoding can help reasoning workloads, but prefix caching and KV quantization may not always pay off.

Summary arXiv PDF

KV Cache2510.09665

LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference

For hosted local-AI products, cache reuse and prefill-decode disaggregation can beat simply adding more GPUs.

Summary arXiv PDF

Quantization2509.23202

Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization

RTX 5090 and Blackwell FP4 claims should be judged against real FP4 kernels and accuracy, not just advertised tensor formats.

Summary arXiv PDF

KV Cache2508.10395

XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization

A faster GPU is not always the answer; sometimes spending compute to reduce memory movement is the better trade.

Summary arXiv PDF

Local Inference2508.08531

Profiling Large Language Model Inference on Apple Silicon: A Quantization Perspective

Mac recommendations should compare quantized throughput, memory pressure, and unified-memory behavior against CUDA GPUs.

Summary arXiv PDF

Local Inference2506.20187

Breaking the Boundaries of Long-Context LLM Inference: Adaptive KV Management on a Single Commodity GPU

A single desktop GPU can handle longer contexts when the runtime manages GPU, CPU, and disk tiers deliberately.

Summary arXiv PDF

Local Inference2506.03296

Parallel CPU-GPU Execution for LLM Inference on Constrained GPUs

Offload is only useful when CPU work, GPU kernels, and transfers overlap; otherwise it just makes a barely fitting model slow.

Summary arXiv PDF

KV Cache2504.19874

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

KV cache compression can expand usable context on the same GPU, but implementation quality determines whether the promise reaches local runtimes.

Summary arXiv PDF

Serving2504.19720

Taming the Titans: A Survey of Efficient LLM Inference Serving

Use this as a 2025 baseline for the serving stack before comparing vLLM, SGLang, TensorRT-LLM, and local runtimes.

Summary arXiv PDF

Kernels2503.08311

Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference

Batch-size scaling should be benchmarked against memory bandwidth behavior, not inferred from TFLOPS alone.

Summary arXiv PDF

KV Cache2502.04420

KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference

KV cache quantization should be model-aware; uniform low-bit settings can waste quality or memory depending on the layer.

Summary arXiv PDF

Browse GPUs by VRAM, bandwidth, and price Compare GPUs side by side Read the 2026 local AI GPU buying guide