Quantization2026Intermediate
Musa Cim, Burak Topcu, Mahmut Taylan Kandemir
A component-wise sensitivity study of NVFP4 and MXFP4 quantization across Qwen2.5 model scales and transformer blocks.
GPU Hunter takeaway: Treat FP4 as a hardware capability that still needs model-aware quantization policy, not a universal speed switch.
Quantization2025Advanced
Vage Egiazarian, Roberto L. Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit
A study of MXFP4 and NVFP4 post-training quantization that introduces Micro-Rotated-GPTQ and reports GPU kernel-backed speedups.
GPU Hunter takeaway: RTX 5090 and Blackwell FP4 claims should be judged against real FP4 kernels and accuracy, not just advertised tensor formats.
Local Inference2026Starter
Uygar Kurt
A unified empirical evaluation of llama.cpp quantization formats for Llama-3.1-8B-Instruct on commodity hardware.
GPU Hunter takeaway: A GPU recommendation is incomplete without naming the quantization formats that are realistic for that VRAM tier.
Quantization2023Starter
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Ji Lin, Jiaming Tang, Haotian Tang, Song Han
A hardware-friendly weight-only quantization method that protects salient channels based on activation statistics.
GPU Hunter takeaway: A practical paper for understanding why some 4-bit models are much faster than others on the same GPU.
Quantization2022Starter
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh
A one-shot weight quantization method that uses approximate second-order information to compress large GPT-style models to 3-4 bits.
GPU Hunter takeaway: A 24GB card becomes far more useful when the model can be loaded at 4-bit without a major quality collapse.
Quantization2022Intermediate
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
Guangxuan Xiao, Ji Lin, Mickael Seznec, Song Han
A post-training quantization approach that moves activation outlier difficulty into weights so W8A8 inference can stay accurate.
GPU Hunter takeaway: Read this when comparing INT8, W8A8, and FP8 serving paths on NVIDIA workstation or datacenter GPUs.
Quantization2023Starter
QLoRA: Efficient Finetuning of Quantized LLMs
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer
A finetuning recipe that backpropagates through a frozen 4-bit model into LoRA adapters using NF4 and paged optimizers.
GPU Hunter takeaway: If a GPU can barely fit a base model for inference, QLoRA explains how adapter training can still be possible.
Quantization2024Advanced
QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs
Saleh Ashkboos, Amirkeivan Mohtashami, Torsten Hoefler, James Hensman
A rotation-based quantization method that removes hidden-state outliers and quantizes weights, activations, and KV cache.
GPU Hunter takeaway: Good background for why next-gen GPUs may make activation and KV-cache precision just as important as weight precision.
Quantization2024Advanced
Extreme Compression of Large Language Models via Additive Quantization
Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Dan Alistarh
AQLM applies learned additive quantization to push LLM compression into the 2-3 bit range with practical CPU and GPU implementations.
GPU Hunter takeaway: Useful for builders choosing between a smaller dense model and an aggressively compressed larger model.