research/local-ai-inference-papers
research cluster / local inference

Local AI Inference Papers for GPUs, llama.cpp and Apple Silicon

Curated local AI inference papers covering llama.cpp quantization, constrained GPUs, Apple Silicon, offload, and single-workstation LLM systems.

Updated May 27, 20268 curated papersPrimary keyword: local AI inference papers
Browse GPUs by VRAM, bandwidth, and price Compare GPUs side by side Read the 2026 local AI GPU buying guide
01  //  editorial context

Local inference is not just cloud serving on smaller hardware. Desktop GPUs, used cards, laptops, and Apple Silicon machines have different memory hierarchies, runtimes, quantization formats, and offload limits.

This cluster focuses on the papers that help GPU Hunter answer the practical question: what can you run on one machine, and what compromises are required when the model barely fits?

02  //  how this changes GPU choice
  • # Use the llama.cpp quantization work before choosing a budget card. The quant format determines whether 12GB, 16GB, or 24GB is realistic.
  • # Apple Silicon papers explain why unified memory capacity is not the same as discrete GPU VRAM, especially for throughput and backend support.
  • # Constrained-GPU papers help evaluate CPU offload and disk tiers without pretending they are free performance.
starter papers

Read these first if you want the fastest path from research to a hardware decision.

Local Inference2023Starter

PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU

Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen

A consumer-GPU inference engine that exploits activation locality and CPU-GPU hybrid execution.

GPU Hunter takeaway: A 4090-class card can punch above its VRAM limit when the runtime is designed around model sparsity and locality.

arXiv PDF HF
advanced papers

These papers go deeper into kernels, cache policy, low-bit formats, or serving tradeoffs.

Local Inference2025Advanced

Parallel CPU-GPU Execution for LLM Inference on Constrained GPUs

Jiakun Fan, Yanglin Zhang, Xiangchen Li, Dimitrios S. Nikolopoulos

A hybrid CPU-GPU execution approach that overlaps CPU-offloaded KV and attention work with GPU execution during memory-bound decoding.

GPU Hunter takeaway: Offload is only useful when CPU work, GPU kernels, and transfers overlap; otherwise it just makes a barely fitting model slow.

Summary arXiv PDF HF
Local Inference2023Intermediate

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

Ying Sheng, Lianmin Zheng, Binhang Yuan, Ion Stoica

A system for running very large models on limited hardware by coordinating GPU, CPU, and disk memory.

GPU Hunter takeaway: Offloading can make a model fit, but it should be treated as a throughput compromise, not free VRAM.

arXiv PDF HF
full curated set

The complete paper set for this topic, with source links and GPU Hunter takeaways.

Local Inference2025Advanced

Parallel CPU-GPU Execution for LLM Inference on Constrained GPUs

Jiakun Fan, Yanglin Zhang, Xiangchen Li, Dimitrios S. Nikolopoulos

A hybrid CPU-GPU execution approach that overlaps CPU-offloaded KV and attention work with GPU execution during memory-bound decoding.

GPU Hunter takeaway: Offload is only useful when CPU work, GPU kernels, and transfers overlap; otherwise it just makes a barely fitting model slow.

Summary arXiv PDF HF
Local Inference2023Starter

PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU

Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen

A consumer-GPU inference engine that exploits activation locality and CPU-GPU hybrid execution.

GPU Hunter takeaway: A 4090-class card can punch above its VRAM limit when the runtime is designed around model sparsity and locality.

arXiv PDF HF
Local Inference2023Intermediate

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

Ying Sheng, Lianmin Zheng, Binhang Yuan, Ion Stoica

A system for running very large models on limited hardware by coordinating GPU, CPU, and disk memory.

GPU Hunter takeaway: Offloading can make a model fit, but it should be treated as a throughput compromise, not free VRAM.

arXiv PDF HF
Local Inference2023Intermediate

LLM in a flash: Efficient Large Language Model Inference with Limited Memory

Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Mehrdad Farajtabar

A hardware-aware method for running models larger than available DRAM by optimizing flash-memory transfers.

GPU Hunter takeaway: Model fit is not binary; memory hierarchy determines whether a barely fitting setup is usable or painful.

arXiv PDF HF
FAQ

Which papers are most useful for a first local AI GPU?

Start with the llama.cpp quantization evaluation, PowerInfer, and the constrained-GPU execution papers. They map directly to VRAM fit, offload, and runtime expectations.

Are Apple Silicon papers comparable to CUDA GPU papers?

Only partly. Apple Silicon uses unified memory and Metal/MLX-style runtimes, so capacity comparisons need separate throughput and backend context.

Source pages are linked directly to arXiv, PDFs, and Hugging Face Papers where available. GPU Hunter summaries are editorial context, not copies of the original abstracts.