research/llm-quantization-papers
research cluster / quantization

Best LLM Quantization Papers for Local AI Inference

Curated LLM quantization papers covering GPTQ, AWQ, GGUF, NF4, FP4, NVFP4, MXFP4, and what low-bit inference means for GPU choice.

Updated May 27, 20269 curated papersPrimary keyword: LLM quantization papers
Browse GPUs by VRAM, bandwidth, and price Compare GPUs side by side Read the 2026 local AI GPU buying guide
01  //  editorial context

Quantization is the most direct way research turns into GPU buying decisions. A model that is impossible at FP16 may fit comfortably at Q4, while a poorly supported low-bit format can save VRAM but fail to improve real tokens per second.

This collection starts with practical local formats such as llama.cpp and AWQ, then moves into FP4, NVFP4, MXFP4, NF4, and layer-sensitive methods. The goal is to help you separate model file size from deployable inference speed.

02  //  how this changes GPU choice
  • # Use these papers when choosing between 16GB, 24GB, and 32GB consumer GPUs. Quantization can make the model fit, but quality and backend support determine whether the card is useful.
  • # Blackwell FP4 support is promising, but the FP4 papers show why model-aware quantization and kernel-backed speedups matter more than a spec-sheet format label.
  • # For used RTX 3090 and RTX 4090 buyers, GPTQ, AWQ, GGUF, and NF4 context explains why 24GB cards remain viable for many local workloads.
starter papers

Read these first if you want the fastest path from research to a hardware decision.

Quantization2023Starter

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Song Han

A hardware-friendly weight-only quantization method that protects salient channels based on activation statistics.

GPU Hunter takeaway: A practical paper for understanding why some 4-bit models are much faster than others on the same GPU.

arXiv PDF HF
Quantization2022Starter

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh

A one-shot weight quantization method that uses approximate second-order information to compress large GPT-style models to 3-4 bits.

GPU Hunter takeaway: A 24GB card becomes far more useful when the model can be loaded at 4-bit without a major quality collapse.

arXiv PDF HF
Quantization2023Starter

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer

A finetuning recipe that backpropagates through a frozen 4-bit model into LoRA adapters using NF4 and paged optimizers.

GPU Hunter takeaway: If a GPU can barely fit a base model for inference, QLoRA explains how adapter training can still be possible.

arXiv PDF HF
advanced papers

These papers go deeper into kernels, cache policy, low-bit formats, or serving tradeoffs.

Quantization2024Advanced

QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs

Saleh Ashkboos, Amirkeivan Mohtashami, Torsten Hoefler, James Hensman

A rotation-based quantization method that removes hidden-state outliers and quantizes weights, activations, and KV cache.

GPU Hunter takeaway: Good background for why next-gen GPUs may make activation and KV-cache precision just as important as weight precision.

arXiv PDF HF
full curated set

The complete paper set for this topic, with source links and GPU Hunter takeaways.

Quantization2023Starter

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Song Han

A hardware-friendly weight-only quantization method that protects salient channels based on activation statistics.

GPU Hunter takeaway: A practical paper for understanding why some 4-bit models are much faster than others on the same GPU.

arXiv PDF HF
Quantization2022Starter

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh

A one-shot weight quantization method that uses approximate second-order information to compress large GPT-style models to 3-4 bits.

GPU Hunter takeaway: A 24GB card becomes far more useful when the model can be loaded at 4-bit without a major quality collapse.

arXiv PDF HF
Quantization2022Intermediate

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Song Han

A post-training quantization approach that moves activation outlier difficulty into weights so W8A8 inference can stay accurate.

GPU Hunter takeaway: Read this when comparing INT8, W8A8, and FP8 serving paths on NVIDIA workstation or datacenter GPUs.

arXiv PDF HF
Quantization2023Starter

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer

A finetuning recipe that backpropagates through a frozen 4-bit model into LoRA adapters using NF4 and paged optimizers.

GPU Hunter takeaway: If a GPU can barely fit a base model for inference, QLoRA explains how adapter training can still be possible.

arXiv PDF HF
Quantization2024Advanced

QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs

Saleh Ashkboos, Amirkeivan Mohtashami, Torsten Hoefler, James Hensman

A rotation-based quantization method that removes hidden-state outliers and quantizes weights, activations, and KV cache.

GPU Hunter takeaway: Good background for why next-gen GPUs may make activation and KV-cache precision just as important as weight precision.

arXiv PDF HF
Quantization2024Advanced

Extreme Compression of Large Language Models via Additive Quantization

Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Dan Alistarh

AQLM applies learned additive quantization to push LLM compression into the 2-3 bit range with practical CPU and GPU implementations.

GPU Hunter takeaway: Useful for builders choosing between a smaller dense model and an aggressively compressed larger model.

arXiv PDF HF
FAQ

Which quantization papers should local AI users read first?

Start with the llama.cpp quantization evaluation, AWQ, GPTQ, and QLoRA. Then read the FP4 and NF4 papers when hardware support or Blackwell-class GPUs matter.

Does quantization always make inference faster?

No. Quantization reduces memory footprint, but speed depends on dequantization, packed-weight kernels, memory bandwidth, and whether the runtime supports that format well.

Source pages are linked directly to arXiv, PDFs, and Hugging Face Papers where available. GPU Hunter summaries are editorial context, not copies of the original abstracts.