research/llm-serving-systems-papers
research cluster / serving systems

LLM Serving Systems Papers for vLLM, PagedAttention and Scheduling

Curated LLM serving systems papers on vLLM, PagedAttention, speculative decoding, batching, cache reuse, scheduling, and GPU server design.

Updated May 27, 202610 curated papersPrimary keyword: LLM serving systems papers
Browse GPUs by VRAM, bandwidth, and price Compare GPUs side by side Read the 2026 local AI GPU buying guide
01  //  editorial context

Serving systems research matters when one GPU becomes a small shared service, internal tool, or inference endpoint. At that point, batching, scheduling, cache reuse, and memory placement can matter more than one-user tokens per second.

This cluster collects the papers behind vLLM/PagedAttention, speculative decoding, phase-aware controllers, AMD serving, and multi-GPU cache placement.

02  //  how this changes GPU choice
  • # For shared workloads, compare GPUs on throughput, VRAM, bandwidth, and runtime support. Single-stream benchmarks are not enough.
  • # PagedAttention and cache layers can increase useful serving capacity without changing the model.
  • # Speculative decoding should be tested against your workload because realistic batching can erase expected gains.
starter papers

Read these first if you want the fastest path from research to a hardware decision.

Serving2025Starter

Taming the Titans: A Survey of Efficient LLM Inference Serving

Ranran Zhen, Juntao Li, Yixin Ji, Zhenlin Yang, Tong Liu, Qingrong Xia

A broad survey of efficient LLM serving methods covering memory overhead, attention costs, batching, quantization, and system design.

GPU Hunter takeaway: Use this as a 2025 baseline for the serving stack before comparing vLLM, SGLang, TensorRT-LLM, and local runtimes.

Summary arXiv PDF HF
Serving2023Starter

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Ying Sheng, Ion Stoica

The vLLM paper introducing PagedAttention, a virtual-memory-style approach for managing dynamic KV cache memory.

GPU Hunter takeaway: If you are running many concurrent chats, memory management can matter as much as raw tok/s.

arXiv PDF HF
Serving2026Intermediate

Speculative Decoding: Performance or Illusion?

Xiaoxuan Liu, Jiaxiang Yu, Jongseok Park, Ion Stoica, Alvin Cheung

A production-grade vLLM study of speculative decoding variants across workloads, model scales, and batch sizes.

GPU Hunter takeaway: Use speculative decoding carefully: the speedup depends on workload, draft method, batch size, and engine implementation.

Summary arXiv PDF HF
advanced papers

These papers go deeper into kernels, cache policy, low-bit formats, or serving tradeoffs.

full curated set

The complete paper set for this topic, with source links and GPU Hunter takeaways.

Serving2025Starter

Taming the Titans: A Survey of Efficient LLM Inference Serving

Ranran Zhen, Juntao Li, Yixin Ji, Zhenlin Yang, Tong Liu, Qingrong Xia

A broad survey of efficient LLM serving methods covering memory overhead, attention costs, batching, quantization, and system design.

GPU Hunter takeaway: Use this as a 2025 baseline for the serving stack before comparing vLLM, SGLang, TensorRT-LLM, and local runtimes.

Summary arXiv PDF HF
Serving2023Starter

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Ying Sheng, Ion Stoica

The vLLM paper introducing PagedAttention, a virtual-memory-style approach for managing dynamic KV cache memory.

GPU Hunter takeaway: If you are running many concurrent chats, memory management can matter as much as raw tok/s.

arXiv PDF HF
Serving2026Intermediate

Speculative Decoding: Performance or Illusion?

Xiaoxuan Liu, Jiaxiang Yu, Jongseok Park, Ion Stoica, Alvin Cheung

A production-grade vLLM study of speculative decoding variants across workloads, model scales, and batch sizes.

GPU Hunter takeaway: Use speculative decoding carefully: the speedup depends on workload, draft method, batch size, and engine implementation.

Summary arXiv PDF HF
Serving2025Intermediate

Reasoning Language Model Inference Serving Unveiled: An Empirical Study

Qi Li, Junpan Wu, Xiang Liu, Yuxin Wang, Zeyu Li, Zhenheng Tang

An empirical study of reasoning model serving behavior, including memory fluctuations, stragglers, adaptive runtime, and optimization tradeoffs.

GPU Hunter takeaway: Quantization and speculative decoding can help reasoning workloads, but prefix caching and KV quantization may not always pay off.

Summary arXiv PDF HF
Serving2023Intermediate

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

Amey Agrawal, Ashish Panwar, Jayashree Mohan, Ramachandran Ramjee

A scheduling approach that chunks prefills and piggybacks decode requests to improve GPU utilization.

GPU Hunter takeaway: Prompt length and batching policy can change throughput even when GPU, model, and quantization stay fixed.

arXiv PDF HF
Serving2024Advanced

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

Amey Agrawal, Nitin Kedia, Ashish Panwar, Ramachandran Ramjee

A serving system that improves tail latency and throughput with chunked prefills and stall-free scheduling.

GPU Hunter takeaway: Relevant when moving from personal local inference to a small shared GPU server.

arXiv PDF HF
FAQ

What is the best first LLM serving paper?

Read the efficient LLM inference serving survey first, then the vLLM PagedAttention paper. That sequence gives both the map and the canonical memory-management system.

Should a home server use serving-system papers?

Yes if more than one user, agent, or process shares the GPU. Scheduling and cache reuse can matter even on a single workstation.

Source pages are linked directly to arXiv, PDFs, and Hugging Face Papers where available. GPU Hunter summaries are editorial context, not copies of the original abstracts.