How much VRAM does the Intel Arc B580 have?

The Intel Arc B580 has 12GB of GDDR6X memory with 456 GB/s bandwidth.

Can the Intel Arc B580 run Qwen3 72B?

Not at Q4 (requires ~42GB). The Intel Arc B580 has 12GB VRAM. Consider Qwen3 32B instead.

What is the Intel Arc B580 inference speed?

On Llama 8B Q4_K_M with llama.cpp, the Intel Arc B580 achieves 35 tok/s decode speed. Q8 runs at 21 tok/s, and FP16 at 12 tok/s.

Intel Arc B580 Benchmarks — 12GB VRAM, 35 tok/s | GPU Hunter

Name: Intel Arc B580
Brand: Intel
Price: 249 USD
Availability: InStock

browse/intel/arc-b580

01 // Inference benchmarks

Single-stream decode · llama.cpp

Llama 8B · Q4_K_M

35 t/s

Llama 8B · Q8_0

21 t/s

Llama 8B · FP16

12 t/s

# env llama.cpp b4732 · 4096 ctx · batch=1 · prompt=512 · temp=0.0 · median of 5 runs

01b // Performance across quantization

vs. nearest competitors

How tok/s scales from FP16 → Q8 → Q4 compared to GPUs in a similar price/VRAM range.

02 // Hardware specs

ArchitectureXe2-HPG

Process nodeTSMC 4nm

Memory12 GB

Memory bandwidth456 GB/s

FP16 compute14.6 TFLOPS

INT8 compute29 TOPS

TDP190 W

PCIeGen 4 x8

Form factorDual-slot

CoolingAxial

03 // Model fit

Approximate VRAM required to load weights + 4096 ctx KV cache.

Qwen3 32B

128k ctx

19 GB

36 GB

FP16

64 GB

Qwen3 72B

128k ctx

42 GB

78 GB

FP16

144 GB

Qwen3 235B

128k ctx

132 GB

240 GB

FP16

470 GB

Llama 3.3 70B

128k ctx

40 GB

75 GB

FP16

140 GB

DeepSeek V3

128k ctx

380 GB

700 GB

FP16

1300 GB

Llama 3.1 8B

128k ctx

5 GB

FITS

9 GB

FITS

FP16

16 GB

Qwen3 14B

128k ctx

8 GB

FITS

15 GB

FP16

28 GB

Mistral 7B

32k ctx

4 GB

FITS

8 GB

FITS

FP16

14 GB

Gemma 2 27B

8k ctx

16 GB

30 GB

FP16

54 GB

Codestral 22B

32k ctx

13 GB

24 GB

FP16

44 GB

+ STRENGTHS

✓12GB VRAM is enough for 32B-class models at Q4
✓456 GB/s memory bandwidth · top tier in its class
✓Strong tooling: FP16, Q8, Q4 all officially supported

− TRADE-OFFS

−Draws 190W under load — plan PSU and thermals accordingly
−Limited to dual-slot chassis
−Driver lock-in to vendor stack

related research

Research behind Intel Arc B580 inference tradeoffs

These papers explain the quantization, cache, bandwidth, and runtime constraints that matter before buying this GPU for local AI.

GPU inference optimization papers

Memory bandwidth, FlashAttention, dequant kernels, and backend maturity.

Open

Local AI inference papers

llama.cpp, Apple Silicon, constrained GPUs, offload, and one-box inference.

Open

LLM serving systems papers

vLLM, PagedAttention, speculative decoding, batching, and GPU servers.

Open

04 // You may also be considering

Open compare

RP6

RTX PRO 6000 Blackwell

browse/intel/arc-b580

01 // Inference benchmarks

Single-stream decode · llama.cpp

Llama 8B · Q4_K_M

35 t/s

Llama 8B · Q8_0

21 t/s

Llama 8B · FP16

12 t/s

# env llama.cpp b4732 · 4096 ctx · batch=1 · prompt=512 · temp=0.0 · median of 5 runs

01b // Performance across quantization

vs. nearest competitors

How tok/s scales from FP16 → Q8 → Q4 compared to GPUs in a similar price/VRAM range.

02 // Hardware specs

ArchitectureXe2-HPG

Process nodeTSMC 4nm

Memory12 GB

Memory bandwidth456 GB/s

FP16 compute14.6 TFLOPS

INT8 compute29 TOPS

TDP190 W

PCIeGen 4 x8

Form factorDual-slot

CoolingAxial

03 // Model fit

Approximate VRAM required to load weights + 4096 ctx KV cache.

Qwen3 32B

128k ctx

19 GB

36 GB

FP16

64 GB

Qwen3 72B

128k ctx

42 GB

78 GB

FP16

144 GB

Qwen3 235B

128k ctx

132 GB

240 GB

FP16

470 GB

Llama 3.3 70B

128k ctx

40 GB

75 GB

FP16

140 GB

DeepSeek V3

128k ctx

380 GB

700 GB

FP16

1300 GB

Llama 3.1 8B

128k ctx

5 GB

FITS

9 GB

FITS

FP16

16 GB

Qwen3 14B

128k ctx

8 GB

FITS

15 GB

FP16

28 GB

Mistral 7B

32k ctx

4 GB

FITS

8 GB

FITS

FP16

14 GB

Gemma 2 27B

8k ctx

16 GB

30 GB

FP16

54 GB

Codestral 22B

32k ctx

13 GB

24 GB

FP16

44 GB

+ STRENGTHS

✓12GB VRAM is enough for 32B-class models at Q4
✓456 GB/s memory bandwidth · top tier in its class
✓Strong tooling: FP16, Q8, Q4 all officially supported

− TRADE-OFFS

−Draws 190W under load — plan PSU and thermals accordingly
−Limited to dual-slot chassis
−Driver lock-in to vendor stack

related research

Research behind Intel Arc B580 inference tradeoffs

These papers explain the quantization, cache, bandwidth, and runtime constraints that matter before buying this GPU for local AI.

GPU inference optimization papers

Memory bandwidth, FlashAttention, dequant kernels, and backend maturity.

Open

Local AI inference papers

llama.cpp, Apple Silicon, constrained GPUs, offload, and one-box inference.

Open

LLM serving systems papers

vLLM, PagedAttention, speculative decoding, batching, and GPU servers.

Open

04 // You may also be considering

Open compare

RP6

RTX PRO 6000 Blackwell