Harvest: Opportunistic Peer-to-Peer GPU Caching for LLM Inference

GPU Hunter summary of 2602.00328, focused on what this paper means for local AI inference, quantization, serving behavior, and hardware choice.

2026IntermediatePublished 2026-01-30Updated May 27, 2026

arXiv source PDF Hugging Face Papers

01 // short answer

A peer-to-peer GPU cache management framework that uses high-bandwidth GPU interconnects to reduce host-memory offload latency. Multi-GPU inference is often limited by where model state and KV tensors live, not just aggregate VRAM on paper.

This paper helps explain why a dual-GPU workstation is not automatically twice as useful as one stronger card.

03 // why GPU Hunter includes it

Multi-GPU inference is often limited by where model state and KV tensors live, not just aggregate VRAM on paper. The useful part for GPU Hunter readers is not the abstract result alone; it is the hardware implication: whether a model fits, whether a runtime can use the format, or whether throughput is limited by memory movement instead of arithmetic.

04 // local inference implications

For dual-GPU and workstation setups, interconnect bandwidth and cache placement can change whether extra GPUs actually help. For shared inference, the important question is not only how fast one prompt runs. Batching, scheduling, cache placement, and request mix decide whether a GPU behaves like a reliable service.

05 // key findings for hardware decisions

# Aggregate VRAM is not useful unless state and caches move efficiently.

# GPU interconnect bandwidth can change the economics of multi-GPU inference.

# Cache placement can reduce host-memory offload latency.

06 // what it means for GPU choice

Use this paper when comparing GeForce RTX 3090, GeForce RTX 4090, RTX PRO 6000 Blackwell. Serving workloads need enough VRAM, strong bandwidth, and runtime features that survive batching and concurrency.

GeForce RTX 3090

24GB VRAM / 936 GB/s / $749

GeForce RTX 4090

24GB VRAM / 1008 GB/s / $1799

RTX PRO 6000 Blackwell

96GB VRAM / 1792 GB/s / $8499