research/local-ai-inference-papers

research cluster / local inference

Local AI Inference Papers for GPUs, llama.cpp and Apple Silicon

Curated local AI inference papers covering llama.cpp quantization, constrained GPUs, Apple Silicon, offload, and single-workstation LLM systems.

Updated May 27, 20268 curated papersPrimary keyword: local AI inference papers

Browse GPUs by VRAM, bandwidth, and price Compare GPUs side by side Read the 2026 local AI GPU buying guide

01 // editorial context

Local inference is not just cloud serving on smaller hardware. Desktop GPUs, used cards, laptops, and Apple Silicon machines have different memory hierarchies, runtimes, quantization formats, and offload limits.

This cluster focuses on the papers that help GPU Hunter answer the practical question: what can you run on one machine, and what compromises are required when the model barely fits?

02 // how this changes GPU choice

# Use the llama.cpp quantization work before choosing a budget card. The quant format determines whether 12GB, 16GB, or 24GB is realistic.
# Apple Silicon papers explain why unified memory capacity is not the same as discrete GPU VRAM, especially for throughput and backend support.
# Constrained-GPU papers help evaluate CPU offload and disk tiers without pretending they are free performance.

starter papers

Read these first if you want the fastest path from research to a hardware decision.

Local Inference2026Starter

Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct

Uygar Kurt

A unified empirical evaluation of llama.cpp quantization formats for Llama-3.1-8B-Instruct on commodity hardware.

GPU Hunter takeaway: A GPU recommendation is incomplete without naming the quantization formats that are realistic for that VRAM tier.

Summary arXiv PDF HF

Local Inference2025Starter

Profiling Large Language Model Inference on Apple Silicon: A Quantization Perspective

Afsara Benazir, Felix Xiaozhu Lin

A profiling study of Apple Silicon's unified memory architecture for on-device LLM inference under different quantization choices.

GPU Hunter takeaway: Mac recommendations should compare quantized throughput, memory pressure, and unified-memory behavior against CUDA GPUs.

Summary arXiv PDF HF

Local Inference2023Starter

PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU

Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen

A consumer-GPU inference engine that exploits activation locality and CPU-GPU hybrid execution.

GPU Hunter takeaway: A 4090-class card can punch above its VRAM limit when the runtime is designed around model sparsity and locality.

arXiv PDF HF

advanced papers

These papers go deeper into kernels, cache policy, low-bit formats, or serving tradeoffs.

Local Inference2025Intermediate

Breaking the Boundaries of Long-Context LLM Inference: Adaptive KV Management on a Single Commodity GPU

He Sun, Li Li, Mingjun Xiao, Chengzhong Xu

LeoAM uses adaptive hierarchical GPU-CPU-disk KV management, lightweight KV abstracts, compression, and pipelining for long context on one commodity GPU.

GPU Hunter takeaway: A single desktop GPU can handle longer contexts when the runtime manages GPU, CPU, and disk tiers deliberately.

Summary arXiv PDF HF

Local Inference2025Advanced

Parallel CPU-GPU Execution for LLM Inference on Constrained GPUs

Jiakun Fan, Yanglin Zhang, Xiangchen Li, Dimitrios S. Nikolopoulos

A hybrid CPU-GPU execution approach that overlaps CPU-offloaded KV and attention work with GPU execution during memory-bound decoding.

GPU Hunter takeaway: Offload is only useful when CPU work, GPU kernels, and transfers overlap; otherwise it just makes a barely fitting model slow.

Summary arXiv PDF HF

Local Inference2023Intermediate

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

Ying Sheng, Lianmin Zheng, Binhang Yuan, Ion Stoica

A system for running very large models on limited hardware by coordinating GPU, CPU, and disk memory.

GPU Hunter takeaway: Offloading can make a model fit, but it should be treated as a throughput compromise, not free VRAM.

arXiv PDF HF

full curated set

The complete paper set for this topic, with source links and GPU Hunter takeaways.

Local Inference2026Intermediate

ModeSwitch-LLM: A Lightweight Phase-Aware Controller for Cross-Mode LLM Inference on a Single GPU

Aman Sunesh, Ali Alshehhi, Hivansh Dhakne

A single-GPU controller that routes each request across FP16, quantized, speculative, prefix-cached, and batched inference modes using cheap workload features.

GPU Hunter takeaway: A desktop GPU can serve different prompts better when the runtime switches strategy per request instead of locking into one mode.

Summary arXiv PDF HF

Local Inference2026Starter

Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct

Uygar Kurt

A unified empirical evaluation of llama.cpp quantization formats for Llama-3.1-8B-Instruct on commodity hardware.

GPU Hunter takeaway: A GPU recommendation is incomplete without naming the quantization formats that are realistic for that VRAM tier.

Summary arXiv PDF HF

Local Inference2025Starter

Profiling Large Language Model Inference on Apple Silicon: A Quantization Perspective

Afsara Benazir, Felix Xiaozhu Lin

A profiling study of Apple Silicon's unified memory architecture for on-device LLM inference under different quantization choices.

GPU Hunter takeaway: Mac recommendations should compare quantized throughput, memory pressure, and unified-memory behavior against CUDA GPUs.

Summary arXiv PDF HF

Local Inference2025Intermediate

Breaking the Boundaries of Long-Context LLM Inference: Adaptive KV Management on a Single Commodity GPU

He Sun, Li Li, Mingjun Xiao, Chengzhong Xu

LeoAM uses adaptive hierarchical GPU-CPU-disk KV management, lightweight KV abstracts, compression, and pipelining for long context on one commodity GPU.

GPU Hunter takeaway: A single desktop GPU can handle longer contexts when the runtime manages GPU, CPU, and disk tiers deliberately.

Summary arXiv PDF HF

Local Inference2025Advanced

Parallel CPU-GPU Execution for LLM Inference on Constrained GPUs

Jiakun Fan, Yanglin Zhang, Xiangchen Li, Dimitrios S. Nikolopoulos

A hybrid CPU-GPU execution approach that overlaps CPU-offloaded KV and attention work with GPU execution during memory-bound decoding.

GPU Hunter takeaway: Offload is only useful when CPU work, GPU kernels, and transfers overlap; otherwise it just makes a barely fitting model slow.

Summary arXiv PDF HF

Local Inference2023Starter

PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU

Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen

A consumer-GPU inference engine that exploits activation locality and CPU-GPU hybrid execution.

GPU Hunter takeaway: A 4090-class card can punch above its VRAM limit when the runtime is designed around model sparsity and locality.

arXiv PDF HF

Local Inference2023Intermediate

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

Ying Sheng, Lianmin Zheng, Binhang Yuan, Ion Stoica

A system for running very large models on limited hardware by coordinating GPU, CPU, and disk memory.

GPU Hunter takeaway: Offloading can make a model fit, but it should be treated as a throughput compromise, not free VRAM.

arXiv PDF HF

Local Inference2023Intermediate

LLM in a flash: Efficient Large Language Model Inference with Limited Memory

Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Mehrdad Farajtabar

A hardware-aware method for running models larger than available DRAM by optimizing flash-memory transfers.

GPU Hunter takeaway: Model fit is not binary; memory hierarchy determines whether a barely fitting setup is usable or painful.

arXiv PDF HF

related buying guides

Best Budget GPUs Under $1,000 for AI Used RTX 3090 Buyer's Guide

FAQ

Which papers are most useful for a first local AI GPU?

Start with the llama.cpp quantization evaluation, PowerInfer, and the constrained-GPU execution papers. They map directly to VRAM fit, offload, and runtime expectations.

Are Apple Silicon papers comparable to CUDA GPU papers?

Only partly. Apple Silicon uses unified memory and Metal/MLX-style runtimes, so capacity comparisons need separate throughput and backend context.

Source pages are linked directly to arXiv, PDFs, and Hugging Face Papers where available. GPU Hunter summaries are editorial context, not copies of the original abstracts.