research/llm-serving-systems-papers

research cluster / serving systems

LLM Serving Systems Papers for vLLM, PagedAttention and Scheduling

Curated LLM serving systems papers on vLLM, PagedAttention, speculative decoding, batching, cache reuse, scheduling, and GPU server design.

Updated May 27, 202610 curated papersPrimary keyword: LLM serving systems papers

Browse GPUs by VRAM, bandwidth, and price Compare GPUs side by side Read the 2026 local AI GPU buying guide

01 // editorial context

Serving systems research matters when one GPU becomes a small shared service, internal tool, or inference endpoint. At that point, batching, scheduling, cache reuse, and memory placement can matter more than one-user tokens per second.

This cluster collects the papers behind vLLM/PagedAttention, speculative decoding, phase-aware controllers, AMD serving, and multi-GPU cache placement.

02 // how this changes GPU choice

# For shared workloads, compare GPUs on throughput, VRAM, bandwidth, and runtime support. Single-stream benchmarks are not enough.
# PagedAttention and cache layers can increase useful serving capacity without changing the model.
# Speculative decoding should be tested against your workload because realistic batching can erase expected gains.

starter papers

Read these first if you want the fastest path from research to a hardware decision.

Serving2025Starter

Taming the Titans: A Survey of Efficient LLM Inference Serving

Ranran Zhen, Juntao Li, Yixin Ji, Zhenlin Yang, Tong Liu, Qingrong Xia

A broad survey of efficient LLM serving methods covering memory overhead, attention costs, batching, quantization, and system design.

GPU Hunter takeaway: Use this as a 2025 baseline for the serving stack before comparing vLLM, SGLang, TensorRT-LLM, and local runtimes.

Summary arXiv PDF HF

Serving2023Starter

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Ying Sheng, Ion Stoica

The vLLM paper introducing PagedAttention, a virtual-memory-style approach for managing dynamic KV cache memory.

GPU Hunter takeaway: If you are running many concurrent chats, memory management can matter as much as raw tok/s.

arXiv PDF HF

Serving2026Intermediate

Speculative Decoding: Performance or Illusion?

Xiaoxuan Liu, Jiaxiang Yu, Jongseok Park, Ion Stoica, Alvin Cheung

A production-grade vLLM study of speculative decoding variants across workloads, model scales, and batch sizes.

GPU Hunter takeaway: Use speculative decoding carefully: the speedup depends on workload, draft method, batch size, and engine implementation.

Summary arXiv PDF HF

advanced papers

These papers go deeper into kernels, cache policy, low-bit formats, or serving tradeoffs.

Local Inference2026Intermediate

ModeSwitch-LLM: A Lightweight Phase-Aware Controller for Cross-Mode LLM Inference on a Single GPU

Aman Sunesh, Ali Alshehhi, Hivansh Dhakne

A single-GPU controller that routes each request across FP16, quantized, speculative, prefix-cached, and batched inference modes using cheap workload features.

GPU Hunter takeaway: A desktop GPU can serve different prompts better when the runtime switches strategy per request instead of locking into one mode.

Summary arXiv PDF HF

Serving2026Intermediate

Architecture-Aware LLM Inference Optimization on AMD Instinct GPUs: A Comprehensive Benchmark and Deployment Study

Athos Georgiou

A production inference benchmark on AMD Instinct MI325X GPUs across large dense, MoE, MLA, and GQA model families using vLLM.

GPU Hunter takeaway: For AMD datacenter GPUs, model architecture, KV offload support, runtime kernels, and block size choices can dominate hardware specs.

Summary arXiv PDF HF

Serving2026Intermediate

Harvest: Opportunistic Peer-to-Peer GPU Caching for LLM Inference

Nikhil Gopal, Kostis Kaffes

A peer-to-peer GPU cache management framework that uses high-bandwidth GPU interconnects to reduce host-memory offload latency.

GPU Hunter takeaway: For dual-GPU and workstation setups, interconnect bandwidth and cache placement can change whether extra GPUs actually help.

Summary arXiv PDF HF

full curated set

The complete paper set for this topic, with source links and GPU Hunter takeaways.

Serving2025Starter

Taming the Titans: A Survey of Efficient LLM Inference Serving

Ranran Zhen, Juntao Li, Yixin Ji, Zhenlin Yang, Tong Liu, Qingrong Xia

A broad survey of efficient LLM serving methods covering memory overhead, attention costs, batching, quantization, and system design.

GPU Hunter takeaway: Use this as a 2025 baseline for the serving stack before comparing vLLM, SGLang, TensorRT-LLM, and local runtimes.

Summary arXiv PDF HF

Serving2023Starter

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Ying Sheng, Ion Stoica

The vLLM paper introducing PagedAttention, a virtual-memory-style approach for managing dynamic KV cache memory.

GPU Hunter takeaway: If you are running many concurrent chats, memory management can matter as much as raw tok/s.

arXiv PDF HF

Serving2026Intermediate

Speculative Decoding: Performance or Illusion?

Xiaoxuan Liu, Jiaxiang Yu, Jongseok Park, Ion Stoica, Alvin Cheung

A production-grade vLLM study of speculative decoding variants across workloads, model scales, and batch sizes.

GPU Hunter takeaway: Use speculative decoding carefully: the speedup depends on workload, draft method, batch size, and engine implementation.

Summary arXiv PDF HF

Local Inference2026Intermediate

ModeSwitch-LLM: A Lightweight Phase-Aware Controller for Cross-Mode LLM Inference on a Single GPU

Aman Sunesh, Ali Alshehhi, Hivansh Dhakne

A single-GPU controller that routes each request across FP16, quantized, speculative, prefix-cached, and batched inference modes using cheap workload features.

GPU Hunter takeaway: A desktop GPU can serve different prompts better when the runtime switches strategy per request instead of locking into one mode.

Summary arXiv PDF HF

Serving2026Intermediate

Architecture-Aware LLM Inference Optimization on AMD Instinct GPUs: A Comprehensive Benchmark and Deployment Study

Athos Georgiou

A production inference benchmark on AMD Instinct MI325X GPUs across large dense, MoE, MLA, and GQA model families using vLLM.

GPU Hunter takeaway: For AMD datacenter GPUs, model architecture, KV offload support, runtime kernels, and block size choices can dominate hardware specs.

Summary arXiv PDF HF

Serving2026Intermediate

Harvest: Opportunistic Peer-to-Peer GPU Caching for LLM Inference

Nikhil Gopal, Kostis Kaffes

A peer-to-peer GPU cache management framework that uses high-bandwidth GPU interconnects to reduce host-memory offload latency.

GPU Hunter takeaway: For dual-GPU and workstation setups, interconnect bandwidth and cache placement can change whether extra GPUs actually help.

Summary arXiv PDF HF

KV Cache2025Intermediate

LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference

Yuhan Liu, Yihua Cheng, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaoting Feng

An open-source KV cache layer for extracting, storing, offloading, transferring, and reusing caches across vLLM and SGLang engines.

GPU Hunter takeaway: For hosted local-AI products, cache reuse and prefill-decode disaggregation can beat simply adding more GPUs.

Summary arXiv PDF HF

Serving2025Intermediate

Reasoning Language Model Inference Serving Unveiled: An Empirical Study

Qi Li, Junpan Wu, Xiang Liu, Yuxin Wang, Zeyu Li, Zhenheng Tang

An empirical study of reasoning model serving behavior, including memory fluctuations, stragglers, adaptive runtime, and optimization tradeoffs.

GPU Hunter takeaway: Quantization and speculative decoding can help reasoning workloads, but prefix caching and KV quantization may not always pay off.

Summary arXiv PDF HF

Serving2023Intermediate

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

Amey Agrawal, Ashish Panwar, Jayashree Mohan, Ramachandran Ramjee

A scheduling approach that chunks prefills and piggybacks decode requests to improve GPU utilization.

GPU Hunter takeaway: Prompt length and batching policy can change throughput even when GPU, model, and quantization stay fixed.

arXiv PDF HF

Serving2024Advanced

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

Amey Agrawal, Nitin Kedia, Ashish Panwar, Ramachandran Ramjee

A serving system that improves tail latency and throughput with chunked prefills and stall-free scheduling.

GPU Hunter takeaway: Relevant when moving from personal local inference to a small shared GPU server.

arXiv PDF HF

related buying guides

RTX PRO 6000 Blackwell vs H100 AMD vs NVIDIA for Local AI

FAQ

What is the best first LLM serving paper?

Read the efficient LLM inference serving survey first, then the vLLM PagedAttention paper. That sequence gives both the map and the canonical memory-management system.

Should a home server use serving-system papers?

Yes if more than one user, agent, or process shares the GPU. Scheduling and cache reuse can matter even on a single workstation.

Source pages are linked directly to arXiv, PDFs, and Hugging Face Papers where available. GPU Hunter summaries are editorial context, not copies of the original abstracts.