research/year/2025
research archive / 2025

2025 LLM Inference Research Papers

Curated 2025 LLM inference papers covering KV cache compression, Apple Silicon profiling, constrained GPUs, serving systems, and memory bottlenecks.

Updated May 27, 202612 papers
why this year matters

The 2025 papers in GPU Hunter's library establish many of the practical questions behind 2026 hardware decisions: when cache management matters, how Apple Silicon behaves, and why memory bandwidth limits large-batch inference.

Use this page to understand the bridge from foundational serving work to the newest 2026 optimization papers.

curated papers
Browse GPUs by VRAM, bandwidth, and price Compare GPUs side by side Read the 2026 local AI GPU buying guide