research/llm-quantization-papers

research cluster / quantization

Best LLM Quantization Papers for Local AI Inference

Curated LLM quantization papers covering GPTQ, AWQ, GGUF, NF4, FP4, NVFP4, MXFP4, and what low-bit inference means for GPU choice.

Updated May 27, 20269 curated papersPrimary keyword: LLM quantization papers

Browse GPUs by VRAM, bandwidth, and price Compare GPUs side by side Read the 2026 local AI GPU buying guide

01 // editorial context

Quantization is the most direct way research turns into GPU buying decisions. A model that is impossible at FP16 may fit comfortably at Q4, while a poorly supported low-bit format can save VRAM but fail to improve real tokens per second.

This collection starts with practical local formats such as llama.cpp and AWQ, then moves into FP4, NVFP4, MXFP4, NF4, and layer-sensitive methods. The goal is to help you separate model file size from deployable inference speed.

02 // how this changes GPU choice

# Use these papers when choosing between 16GB, 24GB, and 32GB consumer GPUs. Quantization can make the model fit, but quality and backend support determine whether the card is useful.
# Blackwell FP4 support is promising, but the FP4 papers show why model-aware quantization and kernel-backed speedups matter more than a spec-sheet format label.
# For used RTX 3090 and RTX 4090 buyers, GPTQ, AWQ, GGUF, and NF4 context explains why 24GB cards remain viable for many local workloads.

starter papers

Read these first if you want the fastest path from research to a hardware decision.

Local Inference2026Starter

Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct

Uygar Kurt

A unified empirical evaluation of llama.cpp quantization formats for Llama-3.1-8B-Instruct on commodity hardware.

GPU Hunter takeaway: A GPU recommendation is incomplete without naming the quantization formats that are realistic for that VRAM tier.

Summary arXiv PDF HF

Quantization2023Starter

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Song Han

A hardware-friendly weight-only quantization method that protects salient channels based on activation statistics.

GPU Hunter takeaway: A practical paper for understanding why some 4-bit models are much faster than others on the same GPU.

arXiv PDF HF

Quantization2022Starter

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh

A one-shot weight quantization method that uses approximate second-order information to compress large GPT-style models to 3-4 bits.

GPU Hunter takeaway: A 24GB card becomes far more useful when the model can be loaded at 4-bit without a major quality collapse.

arXiv PDF HF

Quantization2023Starter

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer

A finetuning recipe that backpropagates through a frozen 4-bit model into LoRA adapters using NF4 and paged optimizers.

GPU Hunter takeaway: If a GPU can barely fit a base model for inference, QLoRA explains how adapter training can still be possible.

arXiv PDF HF

advanced papers

These papers go deeper into kernels, cache policy, low-bit formats, or serving tradeoffs.

Quantization2026Intermediate

Diagnosing FP4 inference: a layer-wise and block-wise sensitivity analysis of NVFP4 and MXFP4

Musa Cim, Burak Topcu, Mahmut Taylan Kandemir

A component-wise sensitivity study of NVFP4 and MXFP4 quantization across Qwen2.5 model scales and transformer blocks.

GPU Hunter takeaway: Treat FP4 as a hardware capability that still needs model-aware quantization policy, not a universal speed switch.

Summary arXiv PDF HF

Quantization2025Advanced

Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization

Vage Egiazarian, Roberto L. Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit

A study of MXFP4 and NVFP4 post-training quantization that introduces Micro-Rotated-GPTQ and reports GPU kernel-backed speedups.

GPU Hunter takeaway: RTX 5090 and Blackwell FP4 claims should be judged against real FP4 kernels and accuracy, not just advertised tensor formats.

Summary arXiv PDF HF

Quantization2024Advanced

QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs

Saleh Ashkboos, Amirkeivan Mohtashami, Torsten Hoefler, James Hensman

A rotation-based quantization method that removes hidden-state outliers and quantizes weights, activations, and KV cache.

GPU Hunter takeaway: Good background for why next-gen GPUs may make activation and KV-cache precision just as important as weight precision.

arXiv PDF HF

full curated set

The complete paper set for this topic, with source links and GPU Hunter takeaways.

Quantization2026Intermediate

Diagnosing FP4 inference: a layer-wise and block-wise sensitivity analysis of NVFP4 and MXFP4

Musa Cim, Burak Topcu, Mahmut Taylan Kandemir

A component-wise sensitivity study of NVFP4 and MXFP4 quantization across Qwen2.5 model scales and transformer blocks.

GPU Hunter takeaway: Treat FP4 as a hardware capability that still needs model-aware quantization policy, not a universal speed switch.

Summary arXiv PDF HF

Quantization2025Advanced

Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization

Vage Egiazarian, Roberto L. Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit

A study of MXFP4 and NVFP4 post-training quantization that introduces Micro-Rotated-GPTQ and reports GPU kernel-backed speedups.

GPU Hunter takeaway: RTX 5090 and Blackwell FP4 claims should be judged against real FP4 kernels and accuracy, not just advertised tensor formats.

Summary arXiv PDF HF

Local Inference2026Starter

Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct

Uygar Kurt

A unified empirical evaluation of llama.cpp quantization formats for Llama-3.1-8B-Instruct on commodity hardware.

GPU Hunter takeaway: A GPU recommendation is incomplete without naming the quantization formats that are realistic for that VRAM tier.

Summary arXiv PDF HF

Quantization2023Starter

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Song Han

A hardware-friendly weight-only quantization method that protects salient channels based on activation statistics.

GPU Hunter takeaway: A practical paper for understanding why some 4-bit models are much faster than others on the same GPU.

arXiv PDF HF

Quantization2022Starter

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh

A one-shot weight quantization method that uses approximate second-order information to compress large GPT-style models to 3-4 bits.

GPU Hunter takeaway: A 24GB card becomes far more useful when the model can be loaded at 4-bit without a major quality collapse.

arXiv PDF HF

Quantization2022Intermediate

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Song Han

A post-training quantization approach that moves activation outlier difficulty into weights so W8A8 inference can stay accurate.

GPU Hunter takeaway: Read this when comparing INT8, W8A8, and FP8 serving paths on NVIDIA workstation or datacenter GPUs.

arXiv PDF HF

Quantization2023Starter

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer

A finetuning recipe that backpropagates through a frozen 4-bit model into LoRA adapters using NF4 and paged optimizers.

GPU Hunter takeaway: If a GPU can barely fit a base model for inference, QLoRA explains how adapter training can still be possible.

arXiv PDF HF

Quantization2024Advanced

QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs

Saleh Ashkboos, Amirkeivan Mohtashami, Torsten Hoefler, James Hensman

A rotation-based quantization method that removes hidden-state outliers and quantizes weights, activations, and KV cache.

GPU Hunter takeaway: Good background for why next-gen GPUs may make activation and KV-cache precision just as important as weight precision.

arXiv PDF HF

Quantization2024Advanced

Extreme Compression of Large Language Models via Additive Quantization

Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Dan Alistarh

AQLM applies learned additive quantization to push LLM compression into the 2-3 bit range with practical CPU and GPU implementations.

GPU Hunter takeaway: Useful for builders choosing between a smaller dense model and an aggressively compressed larger model.

arXiv PDF HF

related buying guides

Best GPUs for Local AI in 2026 Best Budget GPUs Under $1,000 for AI

FAQ

Which quantization papers should local AI users read first?

Start with the llama.cpp quantization evaluation, AWQ, GPTQ, and QLoRA. Then read the FP4 and NF4 papers when hardware support or Blackwell-class GPUs matter.

Does quantization always make inference faster?

No. Quantization reduces memory footprint, but speed depends on dequantization, packed-weight kernels, memory bandwidth, and whether the runtime supports that format well.

Source pages are linked directly to arXiv, PDFs, and Hugging Face Papers where available. GPU Hunter summaries are editorial context, not copies of the original abstracts.