Diagnosing FP4 inference: a layer-wise and block-wise sensitivity analysis of NVFP4 and MXFP4

GPU Hunter summary of 2603.08747, focused on what this paper means for local AI inference, quantization, serving behavior, and hardware choice.

2026IntermediatePublished 2026-03-05Updated May 27, 2026

arXiv source PDF Hugging Face Papers

01 // short answer

A component-wise sensitivity study of NVFP4 and MXFP4 quantization across Qwen2.5 model scales and transformer blocks. FP4 support is a major Blackwell and next-gen accelerator selling point, but not every layer tolerates the same low precision.

Blackwell buyers need this context because FP4 is a purchasing feature only when model quality and kernels hold up.

03 // why GPU Hunter includes it

FP4 support is a major Blackwell and next-gen accelerator selling point, but not every layer tolerates the same low precision. The useful part for GPU Hunter readers is not the abstract result alone; it is the hardware implication: whether a model fits, whether a runtime can use the format, or whether throughput is limited by memory movement instead of arithmetic.

04 // local inference implications

Treat FP4 as a hardware capability that still needs model-aware quantization policy, not a universal speed switch. For local inference, quantization is only valuable when it improves model fit without breaking the runtime path. A smaller checkpoint can still be slow if dequantization, packed-weight kernels, or activation handling are weak.

05 // key findings for hardware decisions

# FP4 sensitivity varies by layer and transformer block.

# NVFP4 and MXFP4 need model-aware policies rather than one global switch.

# Hardware FP4 support still depends on quantization strategy and backend support.

06 // what it means for GPU choice

Use this paper when comparing GeForce RTX 5090, RTX PRO 6000 Blackwell, GeForce RTX 5080. It helps decide whether a GPU's supported precision formats are practical advantages or just spec-sheet features.

GeForce RTX 5090

32GB VRAM / 1792 GB/s / $1999

RTX PRO 6000 Blackwell

96GB VRAM / 1792 GB/s / $8499

GeForce RTX 5080

16GB VRAM / 960 GB/s / $999