Kernels2025Intermediate
Pol G. Recasens, Ferran Agullo, Yue Zhu, Chen Wang, Eun Kyung Lee, Olivier Tardieu
A GPU-level analysis showing large-batch LLM inference can remain DRAM-bandwidth bound even when conventional explanations call it compute-bound.
GPU Hunter takeaway: Batch-size scaling should be benchmarked against memory bandwidth behavior, not inferred from TFLOPS alone.
Kernels2026Advanced
Xiangbo Qi, Chaoyi Jiang, Murali Annavaram
A shared-memory kernel optimization for accelerating NF4 dequantization on NVIDIA GPUs during quantized LLM inference.
GPU Hunter takeaway: Quantized model size is not enough; GPU Hunter benchmarks should care whether the backend has fast unpacking and dequant kernels.
Kernels2026Intermediate
FlashInfer-Bench: Building the Virtuous Cycle for AI-driven LLM Systems
Shanli Xing, Yiyan Zhai, Alexander Jiang, Yixin Dong, Yong Wu, Zihao Ye
A benchmark framework connecting GPU kernel definitions, workloads, implementations, and evaluations for real inference systems.
GPU Hunter takeaway: Kernel maturity is a hardware feature in practice; the same GPU can behave very differently across inference backends.
Kernels2022Starter
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Tri Dao, Daniel Y. Fu, Stefano Ermon, Christopher Re
An IO-aware exact attention algorithm that reduces HBM traffic by tiling attention through SRAM.
GPU Hunter takeaway: Long context needs memory-efficient kernels; VRAM capacity alone does not guarantee usable speed.
Kernels2023Intermediate
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Tri Dao
An improved FlashAttention implementation with better work partitioning, occupancy, and warp-level scheduling.
GPU Hunter takeaway: Two GPUs with similar specs can feel different when the runtime exposes better attention kernels.
Kernels2023Advanced
FlashDecoding++: Faster Large Language Model Inference on GPUs
Ke Hong, Guohao Dai, Jiaming Xu, Yu Wang
A decoding-focused inference engine using asynchronous softmax, flat GEMM optimization, and hardware-adaptive dataflow.
GPU Hunter takeaway: Good background for why tok/s rankings depend on decode kernels, batch size, and backend maturity.
Kernels2024Advanced
MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models
Elias Frantar, Roberto L. Castro, Jiale Chen, Dan Alistarh
A mixed-precision kernel design for keeping 4-bit weight inference fast across useful batch sizes.
GPU Hunter takeaway: A quantized model is only as fast as the kernels that can consume its packed weights efficiently.
Kernels2023Advanced
A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library
Ganesh Bikshandi, Jay Shah
A detailed look at implementing FlashAttention-2 on Hopper with CUTLASS, TMA, WGMMA, and fused CUDA kernels.
GPU Hunter takeaway: Architecture-specific kernel support is a real buying consideration for workstation and datacenter GPUs.