Serving2025Starter
Ranran Zhen, Juntao Li, Yixin Ji, Zhenlin Yang, Tong Liu, Qingrong Xia
A broad survey of efficient LLM serving methods covering memory overhead, attention costs, batching, quantization, and system design.
GPU Hunter takeaway: Use this as a 2025 baseline for the serving stack before comparing vLLM, SGLang, TensorRT-LLM, and local runtimes.
Serving2023Starter
Efficient Memory Management for Large Language Model Serving with PagedAttention
Woosuk Kwon, Zhuohan Li, Ying Sheng, Ion Stoica
The vLLM paper introducing PagedAttention, a virtual-memory-style approach for managing dynamic KV cache memory.
GPU Hunter takeaway: If you are running many concurrent chats, memory management can matter as much as raw tok/s.
Serving2026Intermediate
Xiaoxuan Liu, Jiaxiang Yu, Jongseok Park, Ion Stoica, Alvin Cheung
A production-grade vLLM study of speculative decoding variants across workloads, model scales, and batch sizes.
GPU Hunter takeaway: Use speculative decoding carefully: the speedup depends on workload, draft method, batch size, and engine implementation.
Local Inference2026Intermediate
Aman Sunesh, Ali Alshehhi, Hivansh Dhakne
A single-GPU controller that routes each request across FP16, quantized, speculative, prefix-cached, and batched inference modes using cheap workload features.
GPU Hunter takeaway: A desktop GPU can serve different prompts better when the runtime switches strategy per request instead of locking into one mode.
Serving2026Intermediate
Athos Georgiou
A production inference benchmark on AMD Instinct MI325X GPUs across large dense, MoE, MLA, and GQA model families using vLLM.
GPU Hunter takeaway: For AMD datacenter GPUs, model architecture, KV offload support, runtime kernels, and block size choices can dominate hardware specs.
Serving2026Intermediate
Nikhil Gopal, Kostis Kaffes
A peer-to-peer GPU cache management framework that uses high-bandwidth GPU interconnects to reduce host-memory offload latency.
GPU Hunter takeaway: For dual-GPU and workstation setups, interconnect bandwidth and cache placement can change whether extra GPUs actually help.
KV Cache2025Intermediate
Yuhan Liu, Yihua Cheng, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaoting Feng
An open-source KV cache layer for extracting, storing, offloading, transferring, and reusing caches across vLLM and SGLang engines.
GPU Hunter takeaway: For hosted local-AI products, cache reuse and prefill-decode disaggregation can beat simply adding more GPUs.
Serving2025Intermediate
Qi Li, Junpan Wu, Xiang Liu, Yuxin Wang, Zeyu Li, Zhenheng Tang
An empirical study of reasoning model serving behavior, including memory fluctuations, stragglers, adaptive runtime, and optimization tradeoffs.
GPU Hunter takeaway: Quantization and speculative decoding can help reasoning workloads, but prefix caching and KV quantization may not always pay off.
Serving2023Intermediate
SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
Amey Agrawal, Ashish Panwar, Jayashree Mohan, Ramachandran Ramjee
A scheduling approach that chunks prefills and piggybacks decode requests to improve GPU utilization.
GPU Hunter takeaway: Prompt length and batching policy can change throughput even when GPU, model, and quantization stay fixed.
Serving2024Advanced
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
Amey Agrawal, Nitin Kedia, Ashish Panwar, Ramachandran Ramjee
A serving system that improves tail latency and throughput with chunked prefills and stall-free scheduling.
GPU Hunter takeaway: Relevant when moving from personal local inference to a small shared GPU server.