research/LLM serving systems papers/2602.00328
research summary / Serving

Harvest: Opportunistic Peer-to-Peer GPU Caching for LLM Inference

GPU Hunter summary of 2602.00328, focused on what this paper means for local AI inference, quantization, serving behavior, and hardware choice.

2026IntermediatePublished 2026-01-30Updated May 27, 2026
arXiv source PDF Hugging Face Papers
01  //  short answer

A peer-to-peer GPU cache management framework that uses high-bandwidth GPU interconnects to reduce host-memory offload latency. Multi-GPU inference is often limited by where model state and KV tensors live, not just aggregate VRAM on paper.

This paper helps explain why a dual-GPU workstation is not automatically twice as useful as one stronger card.

03  //  why GPU Hunter includes it

Multi-GPU inference is often limited by where model state and KV tensors live, not just aggregate VRAM on paper. The useful part for GPU Hunter readers is not the abstract result alone; it is the hardware implication: whether a model fits, whether a runtime can use the format, or whether throughput is limited by memory movement instead of arithmetic.

04  //  local inference implications

For dual-GPU and workstation setups, interconnect bandwidth and cache placement can change whether extra GPUs actually help. For shared inference, the important question is not only how fast one prompt runs. Batching, scheduling, cache placement, and request mix decide whether a GPU behaves like a reliable service.

05  //  key findings for hardware decisions
# Aggregate VRAM is not useful unless state and caches move efficiently.
# GPU interconnect bandwidth can change the economics of multi-GPU inference.
# Cache placement can reduce host-memory offload latency.
06  //  what it means for GPU choice

Use this paper when comparing GeForce RTX 3090, GeForce RTX 4090, RTX PRO 6000 Blackwell. Serving workloads need enough VRAM, strong bandwidth, and runtime features that survive batching and concurrency.

source links

This page is GPU Hunter editorial context and does not reproduce the paper abstract. Use the original arXiv, PDF, and Hugging Face links for the complete paper text and author-provided details.

arXiv source PDF Hugging Face Papers
Back to LLM serving systems papers
Research page last updated 2026-05-27. Source paper published 2026-01-30.