01  //  research library

Research Library for Local AI Inference, Quantization & GPUs

A curated gallery of arXiv papers that explain why local inference is constrained by VRAM, memory bandwidth, kernels, KV cache, and quantization. Updated with 2025 and 2026 work on FP4, KV-cache compression, Apple Silicon, constrained GPUs, and modern serving engines so visitors can go deeper than a buying guide.

Last updated May 27, 202625 paper summaries indexed
68
curated papers
7
research tracks
2026
latest year
arXiv
primary source

What this library covers

Local inference, LLM quantization, KV cache, GPU memory bandwidth, serving systems, kernels, Apple Silicon, ROCm, CUDA, GGUF, FP4, NF4, and practical model-fit constraints.

How papers affect GPU choice

A paper belongs here when it changes VRAM fit, tokens per second, context length, runtime support, kernel maturity, or whether a cheaper GPU is good enough.

How to use it

Start with a topic page, read the GPU Hunter takeaways, then move into Browse, Compare, GPU detail pages, and buying guides for the hardware decision.

02  //  start here by goal

Crawlable research clusters

2026 research archive
04  //  buying guides

Turn papers into a buying shortlist

Best GPUs for local AI in 2026 Budget GPUs under $1,000 RTX PRO 6000 vs H100
Filters
68 papers
ServingIntermediate
2026 / 2603.10031

Architecture-Aware LLM Inference Optimization on AMD Instinct GPUs: A Comprehensive Benchmark and Deployment Study

Athos Georgiou

A production inference benchmark on AMD Instinct MI325X GPUs across large dense, MoE, MLA, and GQA model families using vLLM.

Why it matters

GPU Hunter needs more than NVIDIA-only assumptions; AMD serving performance depends heavily on architecture-aware configuration.

Local inference takeaway

For AMD datacenter GPUs, model architecture, KV offload support, runtime kernels, and block size choices can dominate hardware specs.

arXiv GPU Hunter summaryPDF HF
QuantizationAdvanced
2026 / 2601.07475

ARCQuant: Boosting NVFP4 Quantization with Augmented Residual Channels for LLMs

Haoqian Meng, Yilun Luo, Yafei Zhao, Wenyuan Liu, Peng Zhang, Xindian Ma

A post-training quantization method for NVFP4 that augments residual channels to reduce low-bit quantization error under hardware constraints.

Why it matters

It is directly tied to NVIDIA Blackwell's FP4 path, where format constraints shape what quantization methods can run fast.

Local inference takeaway

Blackwell-class FP4 gains will depend on quantization methods designed around NVFP4's real block and precision rules.

arXiv PDF HF
KV CacheStarter
2026 / 2604.05012

Comparative Characterization of KV Cache Management Strategies for LLM Inference

Oteo Mamo, Olga Kogiou, Hyunjin Yi, Weikuan Yu

An empirical comparison of KV cache management frameworks including vLLM, InfiniGen, and H2O across latency, throughput, memory, request rates, and model sizes.

Why it matters

It gives buyers and builders a side-by-side view of when paging, offload, eviction, or sparse strategies actually help.

Local inference takeaway

The right KV cache strategy depends on workload shape; no one approach wins across every GPU, context length, and batch size.

arXiv GPU Hunter summaryPDF HF
KV CacheAdvanced
2026 / 2602.08005

DeltaKV: Residual-Based KV Cache Compression via Long-Range Similarity

Jitai Hao, Qiang Huang, Yaowei Wang, Min Zhang, Jun Yu

A residual KV cache compression framework that exploits long-range inter-token similarity and shared latent components.

Why it matters

Long context is increasingly bottlenecked by cache growth, and residual structure is another way to reduce memory movement.

Local inference takeaway

For agent and long-document workloads, cache compression quality may matter more than raw single-token decode speed.

arXiv GPU Hunter summaryPDF HF
QuantizationIntermediate
2026 / 2603.08747

Diagnosing FP4 inference: a layer-wise and block-wise sensitivity analysis of NVFP4 and MXFP4

Musa Cim, Burak Topcu, Mahmut Taylan Kandemir

A component-wise sensitivity study of NVFP4 and MXFP4 quantization across Qwen2.5 model scales and transformer blocks.

Why it matters

FP4 support is a major Blackwell and next-gen accelerator selling point, but not every layer tolerates the same low precision.

Local inference takeaway

Treat FP4 as a hardware capability that still needs model-aware quantization policy, not a universal speed switch.

arXiv GPU Hunter summaryPDF HF
KV CacheIntermediate
2026 / 2604.04722

Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs

Sayed Pedram Haeri Boroujeni, Niloufar Mehrabi, Patrick Woods, Gabriel Hillesheim, Abolfazl Razi

An adaptive KV-cache quantization approach for mobile, embedded, and edge LLM inference where memory bandwidth and cache growth dominate.

Why it matters

On-device inference cannot afford fixed precision everywhere; wasting bits directly reduces usable context and throughput.

Local inference takeaway

Small GPUs, laptops, and edge boxes need cache precision policies that spend memory only where the model is sensitive.

arXiv GPU Hunter summaryPDF HF
KernelsAdvanced
2026 / 2604.02556

Fast NF4 Dequantization Kernels for Large Language Model Inference

Xiangbo Qi, Chaoyi Jiang, Murali Annavaram

A shared-memory kernel optimization for accelerating NF4 dequantization on NVIDIA GPUs during quantized LLM inference.

Why it matters

NF4 reduces model memory, but dequantization overhead can erase the win if the GPU kernel path is weak.

Local inference takeaway

Quantized model size is not enough; GPU Hunter benchmarks should care whether the backend has fast unpacking and dequant kernels.

arXiv GPU Hunter summaryPDF HF
KernelsIntermediate
2026 / 2601.00227

FlashInfer-Bench: Building the Virtuous Cycle for AI-driven LLM Systems

Shanli Xing, Yiyan Zhai, Alexander Jiang, Yixin Dong, Yong Wu, Zihao Ye

A benchmark framework connecting GPU kernel definitions, workloads, implementations, and evaluations for real inference systems.

Why it matters

The next wave of inference optimization depends on faster iteration between kernels, benchmarking, and deployment.

Local inference takeaway

Kernel maturity is a hardware feature in practice; the same GPU can behave very differently across inference backends.

arXiv PDF HF
ServingIntermediate
2026 / 2602.00328

Harvest: Opportunistic Peer-to-Peer GPU Caching for LLM Inference

Nikhil Gopal, Kostis Kaffes

A peer-to-peer GPU cache management framework that uses high-bandwidth GPU interconnects to reduce host-memory offload latency.

Why it matters

Multi-GPU inference is often limited by where model state and KV tensors live, not just aggregate VRAM on paper.

Local inference takeaway

For dual-GPU and workstation setups, interconnect bandwidth and cache placement can change whether extra GPUs actually help.

arXiv GPU Hunter summaryPDF HF
KV CacheStarter
2026 / 2603.20397

KV Cache Optimization Strategies for Scalable and Efficient LLM Inference

Yichun Xu, Navjot K. Khaira, Tejinder Singh

A survey that organizes KV cache optimization into eviction, compression, hybrid memory, novel attention, and combined strategies.

Why it matters

It is a useful map of the current KV-cache field, which has become the core bottleneck for long-context inference.

Local inference takeaway

Use this to decide whether a workload needs cache compression, offload, eviction, or a hybrid policy before buying more VRAM.

arXiv GPU Hunter summaryPDF HF
Local InferenceIntermediate
2026 / 2605.23057

ModeSwitch-LLM: A Lightweight Phase-Aware Controller for Cross-Mode LLM Inference on a Single GPU

Aman Sunesh, Ali Alshehhi, Hivansh Dhakne

A single-GPU controller that routes each request across FP16, quantized, speculative, prefix-cached, and batched inference modes using cheap workload features.

Why it matters

It treats local inference as a dynamic operating problem instead of a one-time model loading choice.

Local inference takeaway

A desktop GPU can serve different prompts better when the runtime switches strategy per request instead of locking into one mode.

arXiv GPU Hunter summaryPDF HF
KV CacheAdvanced
2026 / 2605.17757

OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

Zhongzhu Zhou, Donglin Zhuang, Jisen Li, Ziyan Chen

A 2-bit KV cache quantization method that derives offline rotations and clipping thresholds from attention-aware covariance structure.

Why it matters

INT2 KV cache is one of the most aggressive ways to stretch long context on limited VRAM, but only if accuracy holds.

Local inference takeaway

Long-context claims on consumer GPUs will increasingly depend on KV-cache-specific quantization, not just weight quantization.

arXiv GPU Hunter summaryPDF HF
KV CacheAdvanced
2026 / 2604.19157

SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving

Jinda Jia, Jisen Li, Zhongzhu Zhou, Jung Hwan Heo, Jue Wang, Tri Dao

A practical INT4 KV-cache method designed around paged layouts, regular memory access, and fused attention execution.

Why it matters

Many KV compression papers look good offline but fail when the serving engine needs predictable memory access and fast kernels.

Local inference takeaway

For vLLM-style serving, deployability matters as much as compression ratio; the cache format has to fit the kernel path.

arXiv GPU Hunter summaryPDF HF
ServingIntermediate
2026 / 2601.11580

Speculative Decoding: Performance or Illusion?

Xiaoxuan Liu, Jiaxiang Yu, Jongseok Park, Ion Stoica, Alvin Cheung

A production-grade vLLM study of speculative decoding variants across workloads, model scales, and batch sizes.

Why it matters

Speculative decoding can look excellent in small research demos while underperforming under realistic batching and serving pressure.

Local inference takeaway

Use speculative decoding carefully: the speedup depends on workload, draft method, batch size, and engine implementation.

arXiv GPU Hunter summaryPDF HF
Local InferenceStarter
2026 / 2601.14277

Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct

Uygar Kurt

A unified empirical evaluation of llama.cpp quantization formats for Llama-3.1-8B-Instruct on commodity hardware.

Why it matters

Most local users choose between GGUF quantization formats before they understand the real quality and runtime tradeoffs.

Local inference takeaway

A GPU recommendation is incomplete without naming the quantization formats that are realistic for that VRAM tier.

arXiv GPU Hunter summaryPDF HF
Local InferenceIntermediate
2025 / 2506.20187

Breaking the Boundaries of Long-Context LLM Inference: Adaptive KV Management on a Single Commodity GPU

He Sun, Li Li, Mingjun Xiao, Chengzhong Xu

LeoAM uses adaptive hierarchical GPU-CPU-disk KV management, lightweight KV abstracts, compression, and pipelining for long context on one commodity GPU.

Why it matters

It targets the exact GPU Hunter user who wants long-context local inference without a datacenter card.

Local inference takeaway

A single desktop GPU can handle longer contexts when the runtime manages GPU, CPU, and disk tiers deliberately.

arXiv GPU Hunter summaryPDF HF
QuantizationAdvanced
2025 / 2509.23202

Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization

Vage Egiazarian, Roberto L. Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit

A study of MXFP4 and NVFP4 post-training quantization that introduces Micro-Rotated-GPTQ and reports GPU kernel-backed speedups.

Why it matters

FP4 is marketed as the next inference leap, but this paper shows why format-specific quantization is required to unlock it.

Local inference takeaway

RTX 5090 and Blackwell FP4 claims should be judged against real FP4 kernels and accuracy, not just advertised tensor formats.

arXiv GPU Hunter summaryPDF HF
KV CacheAdvanced
2025 / 2511.01815

KV Cache Transform Coding for Compact Storage in LLM Inference

Konrad Staniszewski, Adrian Lancucki

A lightweight transform coder that compresses reusable KV caches using PCA-style decorrelation, adaptive quantization, and entropy coding.

Why it matters

Shared-prefix chat and coding workflows can accumulate stale caches that consume GPU memory or force recomputation.

Local inference takeaway

Persistent KV cache storage can become a product feature for coding agents and local chat apps, not just a runtime optimization.

arXiv GPU Hunter summaryPDF HF
KV CacheAdvanced
2025 / 2502.04420

KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference

Xing Li, Zeyu Xing, Yiming Li, Linping Qu, Hui-Ling Zhen, Wulong Liu

A mixed-precision KV-cache quantization framework that searches layer-wise key/value precision pairs under hardware-friendly constraints.

Why it matters

It shows why one KV precision setting rarely fits every layer or model, especially under long-context latency targets.

Local inference takeaway

KV cache quantization should be model-aware; uniform low-bit settings can waste quality or memory depending on the layer.

arXiv GPU Hunter summaryPDF HF
KV CacheIntermediate
2025 / 2510.09665

LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference

Yuhan Liu, Yihua Cheng, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaoting Feng

An open-source KV cache layer for extracting, storing, offloading, transferring, and reusing caches across vLLM and SGLang engines.

Why it matters

It reframes KV cache as a shared storage and communication layer rather than private state inside one inference engine.

Local inference takeaway

For hosted local-AI products, cache reuse and prefill-decode disaggregation can beat simply adding more GPUs.

arXiv GPU Hunter summaryPDF HF
KernelsIntermediate
2025 / 2503.08311

Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference

Pol G. Recasens, Ferran Agullo, Yue Zhu, Chen Wang, Eun Kyung Lee, Olivier Tardieu

A GPU-level analysis showing large-batch LLM inference can remain DRAM-bandwidth bound even when conventional explanations call it compute-bound.

Why it matters

It reinforces GPU Hunter's central point: memory bandwidth and data movement often decide inference performance.

Local inference takeaway

Batch-size scaling should be benchmarked against memory bandwidth behavior, not inferred from TFLOPS alone.

arXiv GPU Hunter summaryPDF HF
Local InferenceAdvanced
2025 / 2506.03296

Parallel CPU-GPU Execution for LLM Inference on Constrained GPUs

Jiakun Fan, Yanglin Zhang, Xiangchen Li, Dimitrios S. Nikolopoulos

A hybrid CPU-GPU execution approach that overlaps CPU-offloaded KV and attention work with GPU execution during memory-bound decoding.

Why it matters

Constrained GPUs are the norm for local users, and naive offload often loses to PCIe and scheduling overhead.

Local inference takeaway

Offload is only useful when CPU work, GPU kernels, and transfers overlap; otherwise it just makes a barely fitting model slow.

arXiv GPU Hunter summaryPDF HF
Local InferenceStarter
2025 / 2508.08531

Profiling Large Language Model Inference on Apple Silicon: A Quantization Perspective

Afsara Benazir, Felix Xiaozhu Lin

A profiling study of Apple Silicon's unified memory architecture for on-device LLM inference under different quantization choices.

Why it matters

Apple Silicon competes on unified memory capacity rather than discrete-GPU VRAM, so it needs a different inference mental model.

Local inference takeaway

Mac recommendations should compare quantized throughput, memory pressure, and unified-memory behavior against CUDA GPUs.

arXiv GPU Hunter summaryPDF HF
ServingIntermediate
2025 / 2510.18672

Reasoning Language Model Inference Serving Unveiled: An Empirical Study

Qi Li, Junpan Wu, Xiang Liu, Yuxin Wang, Zeyu Li, Zhenheng Tang

An empirical study of reasoning model serving behavior, including memory fluctuations, stragglers, adaptive runtime, and optimization tradeoffs.

Why it matters

Reasoning models change the cost profile of inference because long outputs and variable thinking time stress serving systems.

Local inference takeaway

Quantization and speculative decoding can help reasoning workloads, but prefix caching and KV quantization may not always pay off.

arXiv GPU Hunter summaryPDF HF
ServingStarter
2025 / 2504.19720

Taming the Titans: A Survey of Efficient LLM Inference Serving

Ranran Zhen, Juntao Li, Yixin Ji, Zhenlin Yang, Tong Liu, Qingrong Xia

A broad survey of efficient LLM serving methods covering memory overhead, attention costs, batching, quantization, and system design.

Why it matters

It gives readers a current map before diving into specialized serving, kernel, and KV-cache papers.

Local inference takeaway

Use this as a 2025 baseline for the serving stack before comparing vLLM, SGLang, TensorRT-LLM, and local runtimes.

arXiv GPU Hunter summaryPDF HF
KV CacheAdvanced
2025 / 2504.19874

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Amir Zandieh, Majid Daliri, Majid Hadian, Vahab Mirrokni

A near-optimal online vector quantization method used for KV-cache compression and inner-product-preserving low-bit representations.

Why it matters

TurboQuant became a major 2026 discussion point because KV cache memory is now the limiting factor for long-context serving.

Local inference takeaway

KV cache compression can expand usable context on the same GPU, but implementation quality determines whether the promise reaches local runtimes.

arXiv GPU Hunter summaryPDF HF
KV CacheAdvanced
2025 / 2508.10395

XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization

Aditya Tomar, Coleman Hooper, Minjae Lee, Haocheng Xi, Rishabh Tiwari, Wonjun Kang

A cache-rematerialization approach that trades extra computation for lower KV cache memory traffic and storage pressure.

Why it matters

Modern GPUs often have more compute growth than memory bandwidth growth, making recomputation attractive in the right regime.

Local inference takeaway

A faster GPU is not always the answer; sometimes spending compute to reduce memory movement is the better trade.

arXiv GPU Hunter summaryPDF HF
QuantizationAdvanced
2024 / 2411.04965

BitNet a4.8: 4-bit Activations for 1-bit LLMs

Hongyu Wang, Shuming Ma, Furu Wei

Extends the 1-bit LLM line with 4-bit activations, sparsified intermediate states, and low-bit KV cache support.

Why it matters

It ties model architecture, activation precision, kernels, and KV cache into one inference-efficiency story.

Local inference takeaway

Useful when evaluating vendor claims around FP4/INT4 acceleration versus actual model compatibility.

arXiv PDF HF
ServingAdvanced
2024 / 2401.08671

DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference

Connor Holmes, Masahiro Tanaka, Michael Wyatt, Yuxiong He

A serving system centered on Dynamic SplitFuse for better long-prompt throughput and lower token-level tail latency.

Why it matters

It continues the trend of treating prompt processing and token generation as separate resources to schedule.

Local inference takeaway

A useful reference for agent workloads with long prompts, repeated context, and latency-sensitive generation.

arXiv PDF HF
ServingAdvanced
2024 / 2401.15077

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang

A speculative sampling framework that predicts at the feature level rather than only at the token level.

Why it matters

It shows how decoding acceleration is becoming model-aware instead of purely runtime-side.

Local inference takeaway

Useful when comparing native speculative support in inference engines and model releases.

arXiv PDF HF
QuantizationAdvanced
2024 / 2401.06118

Extreme Compression of Large Language Models via Additive Quantization

Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Dan Alistarh

AQLM applies learned additive quantization to push LLM compression into the 2-3 bit range with practical CPU and GPU implementations.

Why it matters

It maps the frontier where smaller model files start competing with higher-bit quantization on the same hardware budget.

Local inference takeaway

Useful for builders choosing between a smaller dense model and an aggressively compressed larger model.

arXiv PDF HF
KV CacheAdvanced
2024 / 2402.05099

Hydragen: High-Throughput LLM Inference with Shared Prefixes

Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Azalia Mirhoseini

A hardware-aware exact attention implementation that separates shared prefixes from unique suffixes.

Why it matters

It targets the common serving pattern where many requests share the same instruction or retrieved context.

Local inference takeaway

Shared-prefix workloads need different benchmarks than one-off chat prompts.

arXiv PDF HF
KV CacheAdvanced
2024 / 2402.02750

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Zirui Liu, Jiayi Yuan, Hongye Jin, Xia Hu

A 2-bit KV cache quantization method using different quantization layouts for key and value cache tensors.

Why it matters

It offers a concrete path for increasing batch size and context length without changing model weights.

Local inference takeaway

KV cache precision should be part of any serious local inference benchmark at long context.

arXiv PDF HF
KV CacheAdvanced
2024 / 2401.18079

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Amir Gholami

A KV cache quantization approach with per-channel key quantization, pre-RoPE quantization, and non-uniform datatypes.

Why it matters

It targets one of the largest costs in long-context inference: storing attention keys and values.

Local inference takeaway

Context length claims are only credible if the KV cache footprint and decode speed are accounted for.

arXiv PDF HF
KernelsAdvanced
2024 / 2408.11743

MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models

Elias Frantar, Roberto L. Castro, Jiale Chen, Dan Alistarh

A mixed-precision kernel design for keeping 4-bit weight inference fast across useful batch sizes.

Why it matters

It bridges quantized model files and actual GPU speed, which is where many local benchmarks diverge.

Local inference takeaway

A quantized model is only as fast as the kernels that can consume its packed weights efficiently.

arXiv PDF HF
ServingIntermediate
2024 / 2401.10774

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Tianle Cai, Yuhong Li, Zhengyang Geng, Tri Dao

Adds auxiliary decoding heads to a language model so it can propose multiple future tokens in parallel.

Why it matters

Medusa is a practical alternative when maintaining a separate draft model is inconvenient.

Local inference takeaway

Great for understanding why some accelerated models require modified checkpoints, not just a faster server.

arXiv PDF HF
QuantizationAdvanced
2024 / 2404.00456

QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs

Saleh Ashkboos, Amirkeivan Mohtashami, Torsten Hoefler, James Hensman

A rotation-based quantization method that removes hidden-state outliers and quantizes weights, activations, and KV cache.

Why it matters

End-to-end 4-bit inference is the direction hardware vendors are pushing with FP4 and INT4 kernels.

Local inference takeaway

Good background for why next-gen GPUs may make activation and KV-cache precision just as important as weight precision.

arXiv PDF HF
ServingAdvanced
2024 / 2403.02310

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

Amey Agrawal, Nitin Kedia, Ashish Panwar, Ramachandran Ramjee

A serving system that improves tail latency and throughput with chunked prefills and stall-free scheduling.

Why it matters

It shows why benchmark numbers need workload context: single-user speed and serving capacity are not the same metric.

Local inference takeaway

Relevant when moving from personal local inference to a small shared GPU server.

arXiv PDF HF
QuantizationIntermediate
2024 / 2402.17764

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Shuming Ma, Hongyu Wang, Lingxiao Ma, Furu Wei

Introduces BitNet b1.58, a ternary-weight LLM direction targeting lower memory, energy, and latency.

Why it matters

This is a hardware-relevant signal for where ultra-low-bit inference and specialized accelerators may go.

Local inference takeaway

Not a drop-in GGUF replacement today, but important for understanding future FP4, INT4, and 1-bit hardware claims.

arXiv PDF HF
KernelsAdvanced
2023 / 2312.11918

A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library

Ganesh Bikshandi, Jay Shah

A detailed look at implementing FlashAttention-2 on Hopper with CUTLASS, TMA, WGMMA, and fused CUDA kernels.

Why it matters

It makes Hopper-specific performance work tangible instead of treating H100-class speedups as magic.

Local inference takeaway

Architecture-specific kernel support is a real buying consideration for workstation and datacenter GPUs.

arXiv PDF HF
ServingIntermediate
2023 / 2302.01318

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, John Jumper

A speculative sampling method that accelerates autoregressive decoding while preserving sample quality.

Why it matters

It helps distinguish exact acceleration from shortcuts that alter model behavior.

Local inference takeaway

Worth reading before assuming a faster runtime is doing the same work as a slower one.

arXiv PDF HF
QuantizationStarter
2023 / 2306.00978

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Song Han

A hardware-friendly weight-only quantization method that protects salient channels based on activation statistics.

Why it matters

AWQ is widely used in serving stacks because it balances quality, speed, and simple deployment.

Local inference takeaway

A practical paper for understanding why some 4-bit models are much faster than others on the same GPU.

arXiv PDF HF
ServingStarter
2023 / 2309.06180

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Ying Sheng, Ion Stoica

The vLLM paper introducing PagedAttention, a virtual-memory-style approach for managing dynamic KV cache memory.

Why it matters

It explains why serving throughput can improve without changing the model weights or the GPU.

Local inference takeaway

If you are running many concurrent chats, memory management can matter as much as raw tok/s.

arXiv PDF HF
KV CacheIntermediate
2023 / 2309.17453

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han

Introduces StreamingLLM, which keeps attention sink tokens plus a rolling window to support long streaming contexts.

Why it matters

It explains a simple but powerful way to make finite-context models behave better in long conversations.

Local inference takeaway

For chat workloads, retaining the right cache entries can beat blindly growing context until VRAM runs out.

arXiv PDF HF
KernelsIntermediate
2023 / 2307.08691

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao

An improved FlashAttention implementation with better work partitioning, occupancy, and warp-level scheduling.

Why it matters

It shows how much performance can come from kernel implementation details rather than model architecture.

Local inference takeaway

Two GPUs with similar specs can feel different when the runtime exposes better attention kernels.

arXiv PDF HF
KernelsAdvanced
2023 / 2311.01282

FlashDecoding++: Faster Large Language Model Inference on GPUs

Ke Hong, Guohao Dai, Jiaming Xu, Yu Wang

A decoding-focused inference engine using asynchronous softmax, flat GEMM optimization, and hardware-adaptive dataflow.

Why it matters

Decode is the phase users feel as tokens per second, and this paper attacks decode-specific GPU underutilization.

Local inference takeaway

Good background for why tok/s rankings depend on decode kernels, batch size, and backend maturity.

arXiv PDF HF
Local InferenceIntermediate
2023 / 2303.06865

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

Ying Sheng, Lianmin Zheng, Binhang Yuan, Ion Stoica

A system for running very large models on limited hardware by coordinating GPU, CPU, and disk memory.

Why it matters

It is one of the clearest papers on offloading tradeoffs for memory-constrained inference.

Local inference takeaway

Offloading can make a model fit, but it should be treated as a throughput compromise, not free VRAM.

arXiv PDF HF
KV CacheAdvanced
2023 / 2306.14048

H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Beidi Chen

A KV cache eviction policy that keeps recent tokens and heavy hitters that contribute most to attention.

Why it matters

KV cache is often the hidden VRAM bill behind long-context local inference.

Local inference takeaway

Eviction policy can be a better answer than buying more VRAM for every long-context workload.

arXiv PDF HF
Local InferenceIntermediate
2023 / 2312.11514

LLM in a flash: Efficient Large Language Model Inference with Limited Memory

Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Mehrdad Farajtabar

A hardware-aware method for running models larger than available DRAM by optimizing flash-memory transfers.

Why it matters

It addresses the same pain as local Mac and laptop inference: large models, constrained memory, and slow storage paths.

Local inference takeaway

Model fit is not binary; memory hierarchy determines whether a barely fitting setup is usable or painful.

arXiv PDF HF
CompressionIntermediate
2023 / 2305.11627

LLM-Pruner: On the Structural Pruning of Large Language Models

Xinyin Ma, Gongfan Fang, Xinchao Wang

A structural pruning method that removes coupled components and recovers performance with lightweight tuning.

Why it matters

Structural pruning is more deployment-friendly than arbitrary sparsity because hardware can exploit smaller dense shapes.

Local inference takeaway

Useful when evaluating smaller derivative models versus quantized originals.

arXiv PDF HF
Long ContextIntermediate
2023 / 2309.12307

LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models

Yukang Chen, Shengju Qian, Haotian Tang, Jiaya Jia

A parameter-efficient fine-tuning method for extending context length while keeping dense global attention at inference time.

Why it matters

It connects training-time context extension tricks to inference-time memory and attention costs.

Local inference takeaway

A model advertising longer context still needs GPU memory and kernels that make that context usable.

arXiv PDF HF
Local InferenceStarter
2023 / 2312.12456

PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU

Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen

A consumer-GPU inference engine that exploits activation locality and CPU-GPU hybrid execution.

Why it matters

It is directly relevant to GPU Hunter visitors trying to make one desktop GPU behave like a serious inference box.

Local inference takeaway

A 4090-class card can punch above its VRAM limit when the runtime is designed around model sparsity and locality.

arXiv PDF HF
KV CacheIntermediate
2023 / 2311.04934

Prompt Cache: Modular Attention Reuse for Low-Latency Inference

In Gim, Guojun Chen, Seung-seob Lee, Lin Zhong

A method for reusing attention states across prompts that share system messages, templates, or retrieved documents.

Why it matters

Many real apps repeatedly send the same context, making prompt caching a revenue-relevant latency optimization.

Local inference takeaway

If your app has fixed prompts or repeated documents, caching may beat buying a faster GPU.

arXiv PDF HF
QuantizationStarter
2023 / 2305.14314

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer

A finetuning recipe that backpropagates through a frozen 4-bit model into LoRA adapters using NF4 and paged optimizers.

Why it matters

It connects quantization to practical fine-tuning, which is where many local AI builders hit VRAM limits first.

Local inference takeaway

If a GPU can barely fit a base model for inference, QLoRA explains how adapter training can still be possible.

arXiv PDF HF
CompressionAdvanced
2023 / 2310.04564

ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models

Iman Mirzadeh, Keivan Alizadeh, Sachin Mehta, Mehrdad Farajtabar

A study arguing that ReLU-style activation sparsity can reduce inference computation with limited performance tradeoff.

Why it matters

It connects architecture choice to runtime efficiency, not just model quality.

Local inference takeaway

Future local-friendly models may be designed for sparse inference from the start, not compressed after training.

arXiv PDF HF
Long ContextAdvanced
2023 / 2310.01889

Ring Attention with Blockwise Transformers for Near-Infinite Context

Hao Liu, Matei Zaharia, Pieter Abbeel

A blockwise attention method that distributes long sequences across devices while overlapping communication and compute.

Why it matters

It explains the multi-device side of long-context inference and training.

Local inference takeaway

Very long context eventually becomes a distributed systems problem, not just a bigger-GPU problem.

arXiv PDF HF
ServingIntermediate
2023 / 2308.16369

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

Amey Agrawal, Ashish Panwar, Jayashree Mohan, Ramachandran Ramjee

A scheduling approach that chunks prefills and piggybacks decode requests to improve GPU utilization.

Why it matters

It separates prefill and decode as different hardware workloads, which helps explain real-world benchmark variance.

Local inference takeaway

Prompt length and batching policy can change throughput even when GPU, model, and quantization stay fixed.

arXiv PDF HF
CompressionIntermediate
2023 / 2301.00774

SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot

Elias Frantar, Dan Alistarh

A one-shot pruning method for large GPT-family models that can remove many weights with limited accuracy loss.

Why it matters

Pruning is a separate compression axis from quantization and can combine with low-bit weights.

Local inference takeaway

Sparse weights only help local users when the runtime and GPU kernels can exploit the sparsity pattern.

arXiv PDF HF
ServingAdvanced
2023 / 2305.09781

SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Zhihao Jia

A serving system that organizes speculative predictions into token trees and verifies candidates in parallel.

Why it matters

It moves speculative decoding from a simple algorithm to a serving-system design.

Local inference takeaway

Tree verification can matter for batched or distributed inference more than for one-user desktop chat.

arXiv PDF HF
QuantizationAdvanced
2023 / 2306.03078

SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Dan Alistarh

A compressed representation that isolates outlier weights while quantizing the rest to low bit widths.

Why it matters

It shows the accuracy tradeoffs behind near-lossless 3-4 bit compression, especially for smaller local models.

Local inference takeaway

Helpful when comparing memory savings against runtime complexity in consumer-GPU inference engines.

arXiv PDF HF
QuantizationAdvanced
2023 / 2306.07629

SqueezeLLM: Dense-and-Sparse Quantization

Sehoon Kim, Coleman Hooper, Amir Gholami, Kurt Keutzer

A post-training framework that combines non-uniform quantization with dense-and-sparse decomposition for low-bit LLMs.

Why it matters

It directly frames single-batch generation as memory-bandwidth bound, which is central to GPU Hunter's rankings.

Local inference takeaway

Read this to understand why reducing bytes moved can matter more than headline TFLOPS.

arXiv PDF HF
ServingAdvanced
2022 / 2207.00032

DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale

Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang, Yuxiong He

A large-scale inference system covering dense, sparse, multi-GPU, and heterogeneous CPU/NVMe offload scenarios.

Why it matters

It frames inference as a systems problem spanning model parallelism, memory tiers, and workload shape.

Local inference takeaway

Good context for why a single GPU is simple but multi-GPU or offload stacks need serious scheduling work.

arXiv PDF HF
ServingAdvanced
2022 / 2211.05102

Efficiently Scaling Transformer Inference

Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jeff Dean

A systems paper on latency, model FLOPS utilization, partitioning, and inference efficiency for very large transformer models.

Why it matters

It gives a rigorous mental model for latency versus throughput decisions in production inference.

Local inference takeaway

Useful when local hardware choices start moving from one workstation to multi-accelerator setups.

arXiv PDF HF
ServingStarter
2022 / 2211.17192

Fast Inference from Transformers via Speculative Decoding

Yaniv Leviathan, Matan Kalman, Yossi Matias

A decoding algorithm that uses a smaller approximation model to propose tokens and a larger model to verify them.

Why it matters

It is the simplest entry point into a major family of latency-reduction techniques.

Local inference takeaway

Speculative decoding can increase perceived speed without changing the main model's output distribution.

arXiv PDF HF
KernelsStarter
2022 / 2205.14135

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y. Fu, Stefano Ermon, Christopher Re

An IO-aware exact attention algorithm that reduces HBM traffic by tiling attention through SRAM.

Why it matters

It is the canonical paper for understanding why GPU memory movement dominates many transformer workloads.

Local inference takeaway

Long context needs memory-efficient kernels; VRAM capacity alone does not guarantee usable speed.

arXiv PDF HF
QuantizationStarter
2022 / 2210.17323

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh

A one-shot weight quantization method that uses approximate second-order information to compress large GPT-style models to 3-4 bits.

Why it matters

GPTQ made consumer-GPU LLM inference practical for many local users before newer GGUF and AWQ workflows became common.

Local inference takeaway

A 24GB card becomes far more useful when the model can be loaded at 4-bit without a major quality collapse.

arXiv PDF HF
QuantizationStarter
2022 / 2208.07339

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer

A practical INT8 inference method that keeps outlier channels in higher precision while quantizing the bulk of transformer matrix multiplies.

Why it matters

It explains why naive 8-bit inference fails on large transformers and why outliers matter for real hardware speedups.

Local inference takeaway

Useful background for bitsandbytes-style loading and for deciding when INT8 is a safe default over heavier 4-bit compression.

arXiv PDF HF
QuantizationIntermediate
2022 / 2211.10438

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Song Han

A post-training quantization approach that moves activation outlier difficulty into weights so W8A8 inference can stay accurate.

Why it matters

It is a clean explanation of activation outliers, one of the core reasons LLM quantization is harder than ordinary model compression.

Local inference takeaway

Read this when comparing INT8, W8A8, and FP8 serving paths on NVIDIA workstation or datacenter GPUs.

arXiv PDF HF