research/year/2026
research archive / 2026

2026 LLM Inference Research Papers

Curated 2026 LLM inference research across local AI, FP4 quantization, KV cache optimization, kernels, AMD serving, and single-GPU systems.

Updated May 27, 202615 papers
why this year matters

This year page groups the latest 2026 papers in the GPU Hunter research library. The recurring theme is deployability: cache formats, kernel paths, adaptive runtime decisions, and hardware-aware quantization.

Use this page when you want a chronological view before moving into the topic clusters.

curated papers
Quantization2601.07475

ARCQuant: Boosting NVFP4 Quantization with Augmented Residual Channels for LLMs

Blackwell-class FP4 gains will depend on quantization methods designed around NVFP4's real block and precision rules.

arXiv PDF
Kernels2601.00227

FlashInfer-Bench: Building the Virtuous Cycle for AI-driven LLM Systems

Kernel maturity is a hardware feature in practice; the same GPU can behave very differently across inference backends.

arXiv PDF
Browse GPUs by VRAM, bandwidth, and price Compare GPUs side by side Read the 2026 local AI GPU buying guide