ServingIntermediate
2026 / 2603.10031Athos Georgiou
A production inference benchmark on AMD Instinct MI325X GPUs across large dense, MoE, MLA, and GQA model families using vLLM.
Why it matters
GPU Hunter needs more than NVIDIA-only assumptions; AMD serving performance depends heavily on architecture-aware configuration.
Local inference takeaway
For AMD datacenter GPUs, model architecture, KV offload support, runtime kernels, and block size choices can dominate hardware specs.
QuantizationAdvanced
2026 / 2601.07475ARCQuant: Boosting NVFP4 Quantization with Augmented Residual Channels for LLMs
Haoqian Meng, Yilun Luo, Yafei Zhao, Wenyuan Liu, Peng Zhang, Xindian Ma
A post-training quantization method for NVFP4 that augments residual channels to reduce low-bit quantization error under hardware constraints.
Why it matters
It is directly tied to NVIDIA Blackwell's FP4 path, where format constraints shape what quantization methods can run fast.
Local inference takeaway
Blackwell-class FP4 gains will depend on quantization methods designed around NVFP4's real block and precision rules.
KV CacheStarter
2026 / 2604.05012Oteo Mamo, Olga Kogiou, Hyunjin Yi, Weikuan Yu
An empirical comparison of KV cache management frameworks including vLLM, InfiniGen, and H2O across latency, throughput, memory, request rates, and model sizes.
Why it matters
It gives buyers and builders a side-by-side view of when paging, offload, eviction, or sparse strategies actually help.
Local inference takeaway
The right KV cache strategy depends on workload shape; no one approach wins across every GPU, context length, and batch size.
KV CacheAdvanced
2026 / 2602.08005Jitai Hao, Qiang Huang, Yaowei Wang, Min Zhang, Jun Yu
A residual KV cache compression framework that exploits long-range inter-token similarity and shared latent components.
Why it matters
Long context is increasingly bottlenecked by cache growth, and residual structure is another way to reduce memory movement.
Local inference takeaway
For agent and long-document workloads, cache compression quality may matter more than raw single-token decode speed.
QuantizationIntermediate
2026 / 2603.08747Musa Cim, Burak Topcu, Mahmut Taylan Kandemir
A component-wise sensitivity study of NVFP4 and MXFP4 quantization across Qwen2.5 model scales and transformer blocks.
Why it matters
FP4 support is a major Blackwell and next-gen accelerator selling point, but not every layer tolerates the same low precision.
Local inference takeaway
Treat FP4 as a hardware capability that still needs model-aware quantization policy, not a universal speed switch.
KV CacheIntermediate
2026 / 2604.04722Sayed Pedram Haeri Boroujeni, Niloufar Mehrabi, Patrick Woods, Gabriel Hillesheim, Abolfazl Razi
An adaptive KV-cache quantization approach for mobile, embedded, and edge LLM inference where memory bandwidth and cache growth dominate.
Why it matters
On-device inference cannot afford fixed precision everywhere; wasting bits directly reduces usable context and throughput.
Local inference takeaway
Small GPUs, laptops, and edge boxes need cache precision policies that spend memory only where the model is sensitive.
KernelsAdvanced
2026 / 2604.02556Xiangbo Qi, Chaoyi Jiang, Murali Annavaram
A shared-memory kernel optimization for accelerating NF4 dequantization on NVIDIA GPUs during quantized LLM inference.
Why it matters
NF4 reduces model memory, but dequantization overhead can erase the win if the GPU kernel path is weak.
Local inference takeaway
Quantized model size is not enough; GPU Hunter benchmarks should care whether the backend has fast unpacking and dequant kernels.
KernelsIntermediate
2026 / 2601.00227FlashInfer-Bench: Building the Virtuous Cycle for AI-driven LLM Systems
Shanli Xing, Yiyan Zhai, Alexander Jiang, Yixin Dong, Yong Wu, Zihao Ye
A benchmark framework connecting GPU kernel definitions, workloads, implementations, and evaluations for real inference systems.
Why it matters
The next wave of inference optimization depends on faster iteration between kernels, benchmarking, and deployment.
Local inference takeaway
Kernel maturity is a hardware feature in practice; the same GPU can behave very differently across inference backends.
ServingIntermediate
2026 / 2602.00328Nikhil Gopal, Kostis Kaffes
A peer-to-peer GPU cache management framework that uses high-bandwidth GPU interconnects to reduce host-memory offload latency.
Why it matters
Multi-GPU inference is often limited by where model state and KV tensors live, not just aggregate VRAM on paper.
Local inference takeaway
For dual-GPU and workstation setups, interconnect bandwidth and cache placement can change whether extra GPUs actually help.
KV CacheStarter
2026 / 2603.20397Yichun Xu, Navjot K. Khaira, Tejinder Singh
A survey that organizes KV cache optimization into eviction, compression, hybrid memory, novel attention, and combined strategies.
Why it matters
It is a useful map of the current KV-cache field, which has become the core bottleneck for long-context inference.
Local inference takeaway
Use this to decide whether a workload needs cache compression, offload, eviction, or a hybrid policy before buying more VRAM.
Local InferenceIntermediate
2026 / 2605.23057Aman Sunesh, Ali Alshehhi, Hivansh Dhakne
A single-GPU controller that routes each request across FP16, quantized, speculative, prefix-cached, and batched inference modes using cheap workload features.
Why it matters
It treats local inference as a dynamic operating problem instead of a one-time model loading choice.
Local inference takeaway
A desktop GPU can serve different prompts better when the runtime switches strategy per request instead of locking into one mode.
KV CacheAdvanced
2026 / 2605.17757Zhongzhu Zhou, Donglin Zhuang, Jisen Li, Ziyan Chen
A 2-bit KV cache quantization method that derives offline rotations and clipping thresholds from attention-aware covariance structure.
Why it matters
INT2 KV cache is one of the most aggressive ways to stretch long context on limited VRAM, but only if accuracy holds.
Local inference takeaway
Long-context claims on consumer GPUs will increasingly depend on KV-cache-specific quantization, not just weight quantization.
KV CacheAdvanced
2026 / 2604.19157Jinda Jia, Jisen Li, Zhongzhu Zhou, Jung Hwan Heo, Jue Wang, Tri Dao
A practical INT4 KV-cache method designed around paged layouts, regular memory access, and fused attention execution.
Why it matters
Many KV compression papers look good offline but fail when the serving engine needs predictable memory access and fast kernels.
Local inference takeaway
For vLLM-style serving, deployability matters as much as compression ratio; the cache format has to fit the kernel path.
ServingIntermediate
2026 / 2601.11580Xiaoxuan Liu, Jiaxiang Yu, Jongseok Park, Ion Stoica, Alvin Cheung
A production-grade vLLM study of speculative decoding variants across workloads, model scales, and batch sizes.
Why it matters
Speculative decoding can look excellent in small research demos while underperforming under realistic batching and serving pressure.
Local inference takeaway
Use speculative decoding carefully: the speedup depends on workload, draft method, batch size, and engine implementation.
Local InferenceStarter
2026 / 2601.14277Uygar Kurt
A unified empirical evaluation of llama.cpp quantization formats for Llama-3.1-8B-Instruct on commodity hardware.
Why it matters
Most local users choose between GGUF quantization formats before they understand the real quality and runtime tradeoffs.
Local inference takeaway
A GPU recommendation is incomplete without naming the quantization formats that are realistic for that VRAM tier.
Local InferenceIntermediate
2025 / 2506.20187He Sun, Li Li, Mingjun Xiao, Chengzhong Xu
LeoAM uses adaptive hierarchical GPU-CPU-disk KV management, lightweight KV abstracts, compression, and pipelining for long context on one commodity GPU.
Why it matters
It targets the exact GPU Hunter user who wants long-context local inference without a datacenter card.
Local inference takeaway
A single desktop GPU can handle longer contexts when the runtime manages GPU, CPU, and disk tiers deliberately.
QuantizationAdvanced
2025 / 2509.23202Vage Egiazarian, Roberto L. Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit
A study of MXFP4 and NVFP4 post-training quantization that introduces Micro-Rotated-GPTQ and reports GPU kernel-backed speedups.
Why it matters
FP4 is marketed as the next inference leap, but this paper shows why format-specific quantization is required to unlock it.
Local inference takeaway
RTX 5090 and Blackwell FP4 claims should be judged against real FP4 kernels and accuracy, not just advertised tensor formats.
KV CacheAdvanced
2025 / 2511.01815Konrad Staniszewski, Adrian Lancucki
A lightweight transform coder that compresses reusable KV caches using PCA-style decorrelation, adaptive quantization, and entropy coding.
Why it matters
Shared-prefix chat and coding workflows can accumulate stale caches that consume GPU memory or force recomputation.
Local inference takeaway
Persistent KV cache storage can become a product feature for coding agents and local chat apps, not just a runtime optimization.
KV CacheAdvanced
2025 / 2502.04420Xing Li, Zeyu Xing, Yiming Li, Linping Qu, Hui-Ling Zhen, Wulong Liu
A mixed-precision KV-cache quantization framework that searches layer-wise key/value precision pairs under hardware-friendly constraints.
Why it matters
It shows why one KV precision setting rarely fits every layer or model, especially under long-context latency targets.
Local inference takeaway
KV cache quantization should be model-aware; uniform low-bit settings can waste quality or memory depending on the layer.
KV CacheIntermediate
2025 / 2510.09665Yuhan Liu, Yihua Cheng, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaoting Feng
An open-source KV cache layer for extracting, storing, offloading, transferring, and reusing caches across vLLM and SGLang engines.
Why it matters
It reframes KV cache as a shared storage and communication layer rather than private state inside one inference engine.
Local inference takeaway
For hosted local-AI products, cache reuse and prefill-decode disaggregation can beat simply adding more GPUs.
KernelsIntermediate
2025 / 2503.08311Pol G. Recasens, Ferran Agullo, Yue Zhu, Chen Wang, Eun Kyung Lee, Olivier Tardieu
A GPU-level analysis showing large-batch LLM inference can remain DRAM-bandwidth bound even when conventional explanations call it compute-bound.
Why it matters
It reinforces GPU Hunter's central point: memory bandwidth and data movement often decide inference performance.
Local inference takeaway
Batch-size scaling should be benchmarked against memory bandwidth behavior, not inferred from TFLOPS alone.
Local InferenceAdvanced
2025 / 2506.03296Jiakun Fan, Yanglin Zhang, Xiangchen Li, Dimitrios S. Nikolopoulos
A hybrid CPU-GPU execution approach that overlaps CPU-offloaded KV and attention work with GPU execution during memory-bound decoding.
Why it matters
Constrained GPUs are the norm for local users, and naive offload often loses to PCIe and scheduling overhead.
Local inference takeaway
Offload is only useful when CPU work, GPU kernels, and transfers overlap; otherwise it just makes a barely fitting model slow.
Local InferenceStarter
2025 / 2508.08531Afsara Benazir, Felix Xiaozhu Lin
A profiling study of Apple Silicon's unified memory architecture for on-device LLM inference under different quantization choices.
Why it matters
Apple Silicon competes on unified memory capacity rather than discrete-GPU VRAM, so it needs a different inference mental model.
Local inference takeaway
Mac recommendations should compare quantized throughput, memory pressure, and unified-memory behavior against CUDA GPUs.
ServingIntermediate
2025 / 2510.18672Qi Li, Junpan Wu, Xiang Liu, Yuxin Wang, Zeyu Li, Zhenheng Tang
An empirical study of reasoning model serving behavior, including memory fluctuations, stragglers, adaptive runtime, and optimization tradeoffs.
Why it matters
Reasoning models change the cost profile of inference because long outputs and variable thinking time stress serving systems.
Local inference takeaway
Quantization and speculative decoding can help reasoning workloads, but prefix caching and KV quantization may not always pay off.
ServingStarter
2025 / 2504.19720Ranran Zhen, Juntao Li, Yixin Ji, Zhenlin Yang, Tong Liu, Qingrong Xia
A broad survey of efficient LLM serving methods covering memory overhead, attention costs, batching, quantization, and system design.
Why it matters
It gives readers a current map before diving into specialized serving, kernel, and KV-cache papers.
Local inference takeaway
Use this as a 2025 baseline for the serving stack before comparing vLLM, SGLang, TensorRT-LLM, and local runtimes.
KV CacheAdvanced
2025 / 2504.19874Amir Zandieh, Majid Daliri, Majid Hadian, Vahab Mirrokni
A near-optimal online vector quantization method used for KV-cache compression and inner-product-preserving low-bit representations.
Why it matters
TurboQuant became a major 2026 discussion point because KV cache memory is now the limiting factor for long-context serving.
Local inference takeaway
KV cache compression can expand usable context on the same GPU, but implementation quality determines whether the promise reaches local runtimes.
KV CacheAdvanced
2025 / 2508.10395Aditya Tomar, Coleman Hooper, Minjae Lee, Haocheng Xi, Rishabh Tiwari, Wonjun Kang
A cache-rematerialization approach that trades extra computation for lower KV cache memory traffic and storage pressure.
Why it matters
Modern GPUs often have more compute growth than memory bandwidth growth, making recomputation attractive in the right regime.
Local inference takeaway
A faster GPU is not always the answer; sometimes spending compute to reduce memory movement is the better trade.
QuantizationAdvanced
2024 / 2411.04965BitNet a4.8: 4-bit Activations for 1-bit LLMs
Hongyu Wang, Shuming Ma, Furu Wei
Extends the 1-bit LLM line with 4-bit activations, sparsified intermediate states, and low-bit KV cache support.
Why it matters
It ties model architecture, activation precision, kernels, and KV cache into one inference-efficiency story.
Local inference takeaway
Useful when evaluating vendor claims around FP4/INT4 acceleration versus actual model compatibility.
ServingAdvanced
2024 / 2401.08671DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference
Connor Holmes, Masahiro Tanaka, Michael Wyatt, Yuxiong He
A serving system centered on Dynamic SplitFuse for better long-prompt throughput and lower token-level tail latency.
Why it matters
It continues the trend of treating prompt processing and token generation as separate resources to schedule.
Local inference takeaway
A useful reference for agent workloads with long prompts, repeated context, and latency-sensitive generation.
ServingAdvanced
2024 / 2401.15077EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang
A speculative sampling framework that predicts at the feature level rather than only at the token level.
Why it matters
It shows how decoding acceleration is becoming model-aware instead of purely runtime-side.
Local inference takeaway
Useful when comparing native speculative support in inference engines and model releases.
QuantizationAdvanced
2024 / 2401.06118Extreme Compression of Large Language Models via Additive Quantization
Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Dan Alistarh
AQLM applies learned additive quantization to push LLM compression into the 2-3 bit range with practical CPU and GPU implementations.
Why it matters
It maps the frontier where smaller model files start competing with higher-bit quantization on the same hardware budget.
Local inference takeaway
Useful for builders choosing between a smaller dense model and an aggressively compressed larger model.
KV CacheAdvanced
2024 / 2402.05099Hydragen: High-Throughput LLM Inference with Shared Prefixes
Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Azalia Mirhoseini
A hardware-aware exact attention implementation that separates shared prefixes from unique suffixes.
Why it matters
It targets the common serving pattern where many requests share the same instruction or retrieved context.
Local inference takeaway
Shared-prefix workloads need different benchmarks than one-off chat prompts.
KV CacheAdvanced
2024 / 2402.02750KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
Zirui Liu, Jiayi Yuan, Hongye Jin, Xia Hu
A 2-bit KV cache quantization method using different quantization layouts for key and value cache tensors.
Why it matters
It offers a concrete path for increasing batch size and context length without changing model weights.
Local inference takeaway
KV cache precision should be part of any serious local inference benchmark at long context.
KV CacheAdvanced
2024 / 2401.18079KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Amir Gholami
A KV cache quantization approach with per-channel key quantization, pre-RoPE quantization, and non-uniform datatypes.
Why it matters
It targets one of the largest costs in long-context inference: storing attention keys and values.
Local inference takeaway
Context length claims are only credible if the KV cache footprint and decode speed are accounted for.
KernelsAdvanced
2024 / 2408.11743MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models
Elias Frantar, Roberto L. Castro, Jiale Chen, Dan Alistarh
A mixed-precision kernel design for keeping 4-bit weight inference fast across useful batch sizes.
Why it matters
It bridges quantized model files and actual GPU speed, which is where many local benchmarks diverge.
Local inference takeaway
A quantized model is only as fast as the kernels that can consume its packed weights efficiently.
ServingIntermediate
2024 / 2401.10774Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Tianle Cai, Yuhong Li, Zhengyang Geng, Tri Dao
Adds auxiliary decoding heads to a language model so it can propose multiple future tokens in parallel.
Why it matters
Medusa is a practical alternative when maintaining a separate draft model is inconvenient.
Local inference takeaway
Great for understanding why some accelerated models require modified checkpoints, not just a faster server.
QuantizationAdvanced
2024 / 2404.00456QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs
Saleh Ashkboos, Amirkeivan Mohtashami, Torsten Hoefler, James Hensman
A rotation-based quantization method that removes hidden-state outliers and quantizes weights, activations, and KV cache.
Why it matters
End-to-end 4-bit inference is the direction hardware vendors are pushing with FP4 and INT4 kernels.
Local inference takeaway
Good background for why next-gen GPUs may make activation and KV-cache precision just as important as weight precision.
ServingAdvanced
2024 / 2403.02310Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
Amey Agrawal, Nitin Kedia, Ashish Panwar, Ramachandran Ramjee
A serving system that improves tail latency and throughput with chunked prefills and stall-free scheduling.
Why it matters
It shows why benchmark numbers need workload context: single-user speed and serving capacity are not the same metric.
Local inference takeaway
Relevant when moving from personal local inference to a small shared GPU server.
QuantizationIntermediate
2024 / 2402.17764The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Shuming Ma, Hongyu Wang, Lingxiao Ma, Furu Wei
Introduces BitNet b1.58, a ternary-weight LLM direction targeting lower memory, energy, and latency.
Why it matters
This is a hardware-relevant signal for where ultra-low-bit inference and specialized accelerators may go.
Local inference takeaway
Not a drop-in GGUF replacement today, but important for understanding future FP4, INT4, and 1-bit hardware claims.
KernelsAdvanced
2023 / 2312.11918A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library
Ganesh Bikshandi, Jay Shah
A detailed look at implementing FlashAttention-2 on Hopper with CUTLASS, TMA, WGMMA, and fused CUDA kernels.
Why it matters
It makes Hopper-specific performance work tangible instead of treating H100-class speedups as magic.
Local inference takeaway
Architecture-specific kernel support is a real buying consideration for workstation and datacenter GPUs.
ServingIntermediate
2023 / 2302.01318Accelerating Large Language Model Decoding with Speculative Sampling
Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, John Jumper
A speculative sampling method that accelerates autoregressive decoding while preserving sample quality.
Why it matters
It helps distinguish exact acceleration from shortcuts that alter model behavior.
Local inference takeaway
Worth reading before assuming a faster runtime is doing the same work as a slower one.
QuantizationStarter
2023 / 2306.00978AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Ji Lin, Jiaming Tang, Haotian Tang, Song Han
A hardware-friendly weight-only quantization method that protects salient channels based on activation statistics.
Why it matters
AWQ is widely used in serving stacks because it balances quality, speed, and simple deployment.
Local inference takeaway
A practical paper for understanding why some 4-bit models are much faster than others on the same GPU.
ServingStarter
2023 / 2309.06180Efficient Memory Management for Large Language Model Serving with PagedAttention
Woosuk Kwon, Zhuohan Li, Ying Sheng, Ion Stoica
The vLLM paper introducing PagedAttention, a virtual-memory-style approach for managing dynamic KV cache memory.
Why it matters
It explains why serving throughput can improve without changing the model weights or the GPU.
Local inference takeaway
If you are running many concurrent chats, memory management can matter as much as raw tok/s.
KV CacheIntermediate
2023 / 2309.17453Efficient Streaming Language Models with Attention Sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han
Introduces StreamingLLM, which keeps attention sink tokens plus a rolling window to support long streaming contexts.
Why it matters
It explains a simple but powerful way to make finite-context models behave better in long conversations.
Local inference takeaway
For chat workloads, retaining the right cache entries can beat blindly growing context until VRAM runs out.
KernelsIntermediate
2023 / 2307.08691FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Tri Dao
An improved FlashAttention implementation with better work partitioning, occupancy, and warp-level scheduling.
Why it matters
It shows how much performance can come from kernel implementation details rather than model architecture.
Local inference takeaway
Two GPUs with similar specs can feel different when the runtime exposes better attention kernels.
KernelsAdvanced
2023 / 2311.01282FlashDecoding++: Faster Large Language Model Inference on GPUs
Ke Hong, Guohao Dai, Jiaming Xu, Yu Wang
A decoding-focused inference engine using asynchronous softmax, flat GEMM optimization, and hardware-adaptive dataflow.
Why it matters
Decode is the phase users feel as tokens per second, and this paper attacks decode-specific GPU underutilization.
Local inference takeaway
Good background for why tok/s rankings depend on decode kernels, batch size, and backend maturity.
Local InferenceIntermediate
2023 / 2303.06865FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU
Ying Sheng, Lianmin Zheng, Binhang Yuan, Ion Stoica
A system for running very large models on limited hardware by coordinating GPU, CPU, and disk memory.
Why it matters
It is one of the clearest papers on offloading tradeoffs for memory-constrained inference.
Local inference takeaway
Offloading can make a model fit, but it should be treated as a throughput compromise, not free VRAM.
KV CacheAdvanced
2023 / 2306.14048H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Beidi Chen
A KV cache eviction policy that keeps recent tokens and heavy hitters that contribute most to attention.
Why it matters
KV cache is often the hidden VRAM bill behind long-context local inference.
Local inference takeaway
Eviction policy can be a better answer than buying more VRAM for every long-context workload.
Local InferenceIntermediate
2023 / 2312.11514LLM in a flash: Efficient Large Language Model Inference with Limited Memory
Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Mehrdad Farajtabar
A hardware-aware method for running models larger than available DRAM by optimizing flash-memory transfers.
Why it matters
It addresses the same pain as local Mac and laptop inference: large models, constrained memory, and slow storage paths.
Local inference takeaway
Model fit is not binary; memory hierarchy determines whether a barely fitting setup is usable or painful.
CompressionIntermediate
2023 / 2305.11627LLM-Pruner: On the Structural Pruning of Large Language Models
Xinyin Ma, Gongfan Fang, Xinchao Wang
A structural pruning method that removes coupled components and recovers performance with lightweight tuning.
Why it matters
Structural pruning is more deployment-friendly than arbitrary sparsity because hardware can exploit smaller dense shapes.
Local inference takeaway
Useful when evaluating smaller derivative models versus quantized originals.
Long ContextIntermediate
2023 / 2309.12307LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models
Yukang Chen, Shengju Qian, Haotian Tang, Jiaya Jia
A parameter-efficient fine-tuning method for extending context length while keeping dense global attention at inference time.
Why it matters
It connects training-time context extension tricks to inference-time memory and attention costs.
Local inference takeaway
A model advertising longer context still needs GPU memory and kernels that make that context usable.
Local InferenceStarter
2023 / 2312.12456PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen
A consumer-GPU inference engine that exploits activation locality and CPU-GPU hybrid execution.
Why it matters
It is directly relevant to GPU Hunter visitors trying to make one desktop GPU behave like a serious inference box.
Local inference takeaway
A 4090-class card can punch above its VRAM limit when the runtime is designed around model sparsity and locality.
KV CacheIntermediate
2023 / 2311.04934Prompt Cache: Modular Attention Reuse for Low-Latency Inference
In Gim, Guojun Chen, Seung-seob Lee, Lin Zhong
A method for reusing attention states across prompts that share system messages, templates, or retrieved documents.
Why it matters
Many real apps repeatedly send the same context, making prompt caching a revenue-relevant latency optimization.
Local inference takeaway
If your app has fixed prompts or repeated documents, caching may beat buying a faster GPU.
QuantizationStarter
2023 / 2305.14314QLoRA: Efficient Finetuning of Quantized LLMs
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer
A finetuning recipe that backpropagates through a frozen 4-bit model into LoRA adapters using NF4 and paged optimizers.
Why it matters
It connects quantization to practical fine-tuning, which is where many local AI builders hit VRAM limits first.
Local inference takeaway
If a GPU can barely fit a base model for inference, QLoRA explains how adapter training can still be possible.
CompressionAdvanced
2023 / 2310.04564ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models
Iman Mirzadeh, Keivan Alizadeh, Sachin Mehta, Mehrdad Farajtabar
A study arguing that ReLU-style activation sparsity can reduce inference computation with limited performance tradeoff.
Why it matters
It connects architecture choice to runtime efficiency, not just model quality.
Local inference takeaway
Future local-friendly models may be designed for sparse inference from the start, not compressed after training.
Long ContextAdvanced
2023 / 2310.01889Ring Attention with Blockwise Transformers for Near-Infinite Context
Hao Liu, Matei Zaharia, Pieter Abbeel
A blockwise attention method that distributes long sequences across devices while overlapping communication and compute.
Why it matters
It explains the multi-device side of long-context inference and training.
Local inference takeaway
Very long context eventually becomes a distributed systems problem, not just a bigger-GPU problem.
ServingIntermediate
2023 / 2308.16369SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
Amey Agrawal, Ashish Panwar, Jayashree Mohan, Ramachandran Ramjee
A scheduling approach that chunks prefills and piggybacks decode requests to improve GPU utilization.
Why it matters
It separates prefill and decode as different hardware workloads, which helps explain real-world benchmark variance.
Local inference takeaway
Prompt length and batching policy can change throughput even when GPU, model, and quantization stay fixed.
CompressionIntermediate
2023 / 2301.00774SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
Elias Frantar, Dan Alistarh
A one-shot pruning method for large GPT-family models that can remove many weights with limited accuracy loss.
Why it matters
Pruning is a separate compression axis from quantization and can combine with low-bit weights.
Local inference takeaway
Sparse weights only help local users when the runtime and GPU kernels can exploit the sparsity pattern.
ServingAdvanced
2023 / 2305.09781SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Zhihao Jia
A serving system that organizes speculative predictions into token trees and verifies candidates in parallel.
Why it matters
It moves speculative decoding from a simple algorithm to a serving-system design.
Local inference takeaway
Tree verification can matter for batched or distributed inference more than for one-user desktop chat.
QuantizationAdvanced
2023 / 2306.03078SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression
Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Dan Alistarh
A compressed representation that isolates outlier weights while quantizing the rest to low bit widths.
Why it matters
It shows the accuracy tradeoffs behind near-lossless 3-4 bit compression, especially for smaller local models.
Local inference takeaway
Helpful when comparing memory savings against runtime complexity in consumer-GPU inference engines.
QuantizationAdvanced
2023 / 2306.07629SqueezeLLM: Dense-and-Sparse Quantization
Sehoon Kim, Coleman Hooper, Amir Gholami, Kurt Keutzer
A post-training framework that combines non-uniform quantization with dense-and-sparse decomposition for low-bit LLMs.
Why it matters
It directly frames single-batch generation as memory-bandwidth bound, which is central to GPU Hunter's rankings.
Local inference takeaway
Read this to understand why reducing bytes moved can matter more than headline TFLOPS.
ServingAdvanced
2022 / 2207.00032DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale
Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang, Yuxiong He
A large-scale inference system covering dense, sparse, multi-GPU, and heterogeneous CPU/NVMe offload scenarios.
Why it matters
It frames inference as a systems problem spanning model parallelism, memory tiers, and workload shape.
Local inference takeaway
Good context for why a single GPU is simple but multi-GPU or offload stacks need serious scheduling work.
ServingAdvanced
2022 / 2211.05102Efficiently Scaling Transformer Inference
Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jeff Dean
A systems paper on latency, model FLOPS utilization, partitioning, and inference efficiency for very large transformer models.
Why it matters
It gives a rigorous mental model for latency versus throughput decisions in production inference.
Local inference takeaway
Useful when local hardware choices start moving from one workstation to multi-accelerator setups.
ServingStarter
2022 / 2211.17192Fast Inference from Transformers via Speculative Decoding
Yaniv Leviathan, Matan Kalman, Yossi Matias
A decoding algorithm that uses a smaller approximation model to propose tokens and a larger model to verify them.
Why it matters
It is the simplest entry point into a major family of latency-reduction techniques.
Local inference takeaway
Speculative decoding can increase perceived speed without changing the main model's output distribution.
KernelsStarter
2022 / 2205.14135FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Tri Dao, Daniel Y. Fu, Stefano Ermon, Christopher Re
An IO-aware exact attention algorithm that reduces HBM traffic by tiling attention through SRAM.
Why it matters
It is the canonical paper for understanding why GPU memory movement dominates many transformer workloads.
Local inference takeaway
Long context needs memory-efficient kernels; VRAM capacity alone does not guarantee usable speed.
QuantizationStarter
2022 / 2210.17323GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh
A one-shot weight quantization method that uses approximate second-order information to compress large GPT-style models to 3-4 bits.
Why it matters
GPTQ made consumer-GPU LLM inference practical for many local users before newer GGUF and AWQ workflows became common.
Local inference takeaway
A 24GB card becomes far more useful when the model can be loaded at 4-bit without a major quality collapse.
QuantizationStarter
2022 / 2208.07339LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer
A practical INT8 inference method that keeps outlier channels in higher precision while quantizing the bulk of transformer matrix multiplies.
Why it matters
It explains why naive 8-bit inference fails on large transformers and why outliers matter for real hardware speedups.
Local inference takeaway
Useful background for bitsandbytes-style loading and for deciding when INT8 is a safe default over heavier 4-bit compression.
QuantizationIntermediate
2022 / 2211.10438SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
Guangxuan Xiao, Ji Lin, Mickael Seznec, Song Han
A post-training quantization approach that moves activation outlier difficulty into weights so W8A8 inference can stay accurate.
Why it matters
It is a clean explanation of activation outliers, one of the core reasons LLM quantization is harder than ordinary model compression.
Local inference takeaway
Read this when comparing INT8, W8A8, and FP8 serving paths on NVIDIA workstation or datacenter GPUs.