Local Inference2026Intermediate
Aman Sunesh, Ali Alshehhi, Hivansh Dhakne
A single-GPU controller that routes each request across FP16, quantized, speculative, prefix-cached, and batched inference modes using cheap workload features.
GPU Hunter takeaway: A desktop GPU can serve different prompts better when the runtime switches strategy per request instead of locking into one mode.
Local Inference2026Starter
Uygar Kurt
A unified empirical evaluation of llama.cpp quantization formats for Llama-3.1-8B-Instruct on commodity hardware.
GPU Hunter takeaway: A GPU recommendation is incomplete without naming the quantization formats that are realistic for that VRAM tier.
Local Inference2025Starter
Afsara Benazir, Felix Xiaozhu Lin
A profiling study of Apple Silicon's unified memory architecture for on-device LLM inference under different quantization choices.
GPU Hunter takeaway: Mac recommendations should compare quantized throughput, memory pressure, and unified-memory behavior against CUDA GPUs.
Local Inference2025Intermediate
He Sun, Li Li, Mingjun Xiao, Chengzhong Xu
LeoAM uses adaptive hierarchical GPU-CPU-disk KV management, lightweight KV abstracts, compression, and pipelining for long context on one commodity GPU.
GPU Hunter takeaway: A single desktop GPU can handle longer contexts when the runtime manages GPU, CPU, and disk tiers deliberately.
Local Inference2025Advanced
Jiakun Fan, Yanglin Zhang, Xiangchen Li, Dimitrios S. Nikolopoulos
A hybrid CPU-GPU execution approach that overlaps CPU-offloaded KV and attention work with GPU execution during memory-bound decoding.
GPU Hunter takeaway: Offload is only useful when CPU work, GPU kernels, and transfers overlap; otherwise it just makes a barely fitting model slow.
Local Inference2023Starter
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen
A consumer-GPU inference engine that exploits activation locality and CPU-GPU hybrid execution.
GPU Hunter takeaway: A 4090-class card can punch above its VRAM limit when the runtime is designed around model sparsity and locality.
Local Inference2023Intermediate
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU
Ying Sheng, Lianmin Zheng, Binhang Yuan, Ion Stoica
A system for running very large models on limited hardware by coordinating GPU, CPU, and disk memory.
GPU Hunter takeaway: Offloading can make a model fit, but it should be treated as a throughput compromise, not free VRAM.
Local Inference2023Intermediate
LLM in a flash: Efficient Large Language Model Inference with Limited Memory
Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Mehrdad Farajtabar
A hardware-aware method for running models larger than available DRAM by optimizing flash-memory transfers.
GPU Hunter takeaway: Model fit is not binary; memory hierarchy determines whether a barely fitting setup is usable or painful.