ModeSwitch-LLM: A Lightweight Phase-Aware Controller for Cross-Mode LLM Inference on a Single GPU
A desktop GPU can serve different prompts better when the runtime switches strategy per request instead of locking into one mode.
Curated 2026 LLM inference research across local AI, FP4 quantization, KV cache optimization, kernels, AMD serving, and single-GPU systems.
This year page groups the latest 2026 papers in the GPU Hunter research library. The recurring theme is deployability: cache formats, kernel paths, adaptive runtime decisions, and hardware-aware quantization.
Use this page when you want a chronological view before moving into the topic clusters.
A desktop GPU can serve different prompts better when the runtime switches strategy per request instead of locking into one mode.
Long-context claims on consumer GPUs will increasingly depend on KV-cache-specific quantization, not just weight quantization.
For vLLM-style serving, deployability matters as much as compression ratio; the cache format has to fit the kernel path.
The right KV cache strategy depends on workload shape; no one approach wins across every GPU, context length, and batch size.
Small GPUs, laptops, and edge boxes need cache precision policies that spend memory only where the model is sensitive.
Quantized model size is not enough; GPU Hunter benchmarks should care whether the backend has fast unpacking and dequant kernels.
Use this to decide whether a workload needs cache compression, offload, eviction, or a hybrid policy before buying more VRAM.
For AMD datacenter GPUs, model architecture, KV offload support, runtime kernels, and block size choices can dominate hardware specs.
Treat FP4 as a hardware capability that still needs model-aware quantization policy, not a universal speed switch.
For agent and long-document workloads, cache compression quality may matter more than raw single-token decode speed.
For dual-GPU and workstation setups, interconnect bandwidth and cache placement can change whether extra GPUs actually help.
A GPU recommendation is incomplete without naming the quantization formats that are realistic for that VRAM tier.
Use speculative decoding carefully: the speedup depends on workload, draft method, batch size, and engine implementation.
Blackwell-class FP4 gains will depend on quantization methods designed around NVFP4's real block and precision rules.
Kernel maturity is a hardware feature in practice; the same GPU can behave very differently across inference backends.