research/gpu-inference-optimization-papers
research cluster / GPU systems

GPU Inference Optimization Papers: Memory Bandwidth, Kernels & Serving

GPU inference optimization papers on memory bandwidth, FlashAttention, NF4 dequantization, kernel maturity, batching, and LLM bottlenecks.

Updated May 27, 20268 curated papersPrimary keyword: GPU inference optimization papers
Browse GPUs by VRAM, bandwidth, and price Compare GPUs side by side Read the 2026 local AI GPU buying guide
01  //  editorial context

LLM inference performance is often limited by memory movement and kernel implementation, not just advertised FLOPS. This is why two GPUs with similar paper specs can behave differently in llama.cpp, vLLM, TensorRT-LLM, MLX, or ROCm-backed runtimes.

These papers explain the core GPU systems ideas behind GPU Hunter rankings: HBM traffic, attention tiling, dequantization kernels, decode-specific kernels, and benchmark methods that expose real bottlenecks.

02  //  how this changes GPU choice
  • # Memory bandwidth should be treated as a first-class GPU feature for inference. The memory-gap paper is the clearest source for this.
  • # Kernel maturity changes hardware value. A GPU with strong backend support can beat a theoretically stronger card on local workloads.
  • # Quantized formats need fast kernels. NF4, INT4, and FP4 only become buying advantages when the backend can use them efficiently.
starter papers

Read these first if you want the fastest path from research to a hardware decision.

Kernels2022Starter

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y. Fu, Stefano Ermon, Christopher Re

An IO-aware exact attention algorithm that reduces HBM traffic by tiling attention through SRAM.

GPU Hunter takeaway: Long context needs memory-efficient kernels; VRAM capacity alone does not guarantee usable speed.

arXiv PDF HF
Kernels2026Intermediate

FlashInfer-Bench: Building the Virtuous Cycle for AI-driven LLM Systems

Shanli Xing, Yiyan Zhai, Alexander Jiang, Yixin Dong, Yong Wu, Zihao Ye

A benchmark framework connecting GPU kernel definitions, workloads, implementations, and evaluations for real inference systems.

GPU Hunter takeaway: Kernel maturity is a hardware feature in practice; the same GPU can behave very differently across inference backends.

arXiv PDF HF
advanced papers

These papers go deeper into kernels, cache policy, low-bit formats, or serving tradeoffs.

Kernels2024Advanced

MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models

Elias Frantar, Roberto L. Castro, Jiale Chen, Dan Alistarh

A mixed-precision kernel design for keeping 4-bit weight inference fast across useful batch sizes.

GPU Hunter takeaway: A quantized model is only as fast as the kernels that can consume its packed weights efficiently.

arXiv PDF HF
Kernels2023Advanced

A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library

Ganesh Bikshandi, Jay Shah

A detailed look at implementing FlashAttention-2 on Hopper with CUTLASS, TMA, WGMMA, and fused CUDA kernels.

GPU Hunter takeaway: Architecture-specific kernel support is a real buying consideration for workstation and datacenter GPUs.

arXiv PDF HF
full curated set

The complete paper set for this topic, with source links and GPU Hunter takeaways.

Kernels2026Intermediate

FlashInfer-Bench: Building the Virtuous Cycle for AI-driven LLM Systems

Shanli Xing, Yiyan Zhai, Alexander Jiang, Yixin Dong, Yong Wu, Zihao Ye

A benchmark framework connecting GPU kernel definitions, workloads, implementations, and evaluations for real inference systems.

GPU Hunter takeaway: Kernel maturity is a hardware feature in practice; the same GPU can behave very differently across inference backends.

arXiv PDF HF
Kernels2022Starter

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y. Fu, Stefano Ermon, Christopher Re

An IO-aware exact attention algorithm that reduces HBM traffic by tiling attention through SRAM.

GPU Hunter takeaway: Long context needs memory-efficient kernels; VRAM capacity alone does not guarantee usable speed.

arXiv PDF HF
Kernels2023Intermediate

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao

An improved FlashAttention implementation with better work partitioning, occupancy, and warp-level scheduling.

GPU Hunter takeaway: Two GPUs with similar specs can feel different when the runtime exposes better attention kernels.

arXiv PDF HF
Kernels2023Advanced

FlashDecoding++: Faster Large Language Model Inference on GPUs

Ke Hong, Guohao Dai, Jiaming Xu, Yu Wang

A decoding-focused inference engine using asynchronous softmax, flat GEMM optimization, and hardware-adaptive dataflow.

GPU Hunter takeaway: Good background for why tok/s rankings depend on decode kernels, batch size, and backend maturity.

arXiv PDF HF
Kernels2024Advanced

MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models

Elias Frantar, Roberto L. Castro, Jiale Chen, Dan Alistarh

A mixed-precision kernel design for keeping 4-bit weight inference fast across useful batch sizes.

GPU Hunter takeaway: A quantized model is only as fast as the kernels that can consume its packed weights efficiently.

arXiv PDF HF
Kernels2023Advanced

A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library

Ganesh Bikshandi, Jay Shah

A detailed look at implementing FlashAttention-2 on Hopper with CUTLASS, TMA, WGMMA, and fused CUDA kernels.

GPU Hunter takeaway: Architecture-specific kernel support is a real buying consideration for workstation and datacenter GPUs.

arXiv PDF HF
FAQ

Why does GPU Hunter care about bandwidth so much?

Many LLM inference paths move more data than they compute. Bandwidth, cache behavior, and kernels can dominate tokens per second.

Do better kernels change which GPU to buy?

Yes. CUDA, ROCm, Metal, and engine-specific kernels can change the effective value of a GPU for local inference.

Source pages are linked directly to arXiv, PDFs, and Hugging Face Papers where available. GPU Hunter summaries are editorial context, not copies of the original abstracts.