Best GPU for Ollama in 2026: 5 Picks That Actually Work

Published March 15, 2026 · Updated March 15, 2026 · 14 min read

Our Pick

NVIDIA RTX 3090 (Used)

~$800–1,050

24 GB VRAM with full CUDA support means every Ollama model just works. At ~$800–1,050 used, nothing else comes close on value.

Check Price on Amazon →

	★ RTX 3090 (Used) Our Pick	RTX 4060 Ti 16GB Best New Card	Intel Arc B580 Budget Experiment	RTX 4090 Fastest	RX 7900 XTX Best AMD
VRAM	24 GB GDDR6X	16 GB GDDR6	12 GB GDDR6	24 GB GDDR6X	24 GB GDDR6
Bandwidth	936 GB/s	288 GB/s	456 GB/s	1,008 GB/s	960 GB/s
Ollama Backend	CUDA (native)	CUDA (native)	SYCL (experimental)	CUDA (native)	ROCm
llama3 8B tok/s	~112	~89	~25–35*	~128	~37
codellama 13B tok/s	~85	~14	Does not fit	~110	~32
Price	~$800–1,050	~$450	~$360	~$2,755	~$1,300
	Check Price →	Check Price →	Check Price →	Check Price →	Check Price →

Ollama makes running local LLMs simple — ollama run llama3 and you’re generating text. But that simplicity hides a critical dependency: your GPU determines whether you get 130 tokens per second or 3. Pick the wrong card and Ollama silently offloads model layers to system RAM, dropping you from interactive chat speed to something that feels like waiting for a dial-up page to load.

The GPU choice for Ollama is more nuanced than for general LLM inference because Ollama has specific backend requirements. CUDA works out of the box on NVIDIA cards. ROCm works on AMD but only on Linux and with significantly lower performance. Intel’s SYCL backend is experimental at best. Your GPU purchase is also a backend decision.

This guide covers five GPUs that actually work with Ollama in March 2026, with real ollama run performance numbers on the models people actually use: llama3, mistral, codellama, and deepseek-coder. For a broader look at GPUs for all local inference frameworks, see our best GPU for local LLMs guide.

Our Pick: NVIDIA RTX 3090 (Used) — Best Overall for Ollama

The used RTX 3090 is the best GPU for most Ollama users. It has 24 GB of VRAM, full CUDA support, and costs a fraction of what you’d pay for comparable NVIDIA cards still in production.

Specs: 24 GB GDDR6X · 384-bit bus · 936 GB/s bandwidth · 10,496 CUDA cores · 350W TDP

Ollama Performance:

ollama run llama3: ~112 tok/s generation
ollama run codellama:13b: ~85 tok/s generation
ollama run deepseek-coder:33b-instruct-q4_K_M: Fits in 24 GB, ~20–25 tok/s
ollama run llama3:70b-q4_K_M: Requires partial offload — ~4–6 tok/s (not recommended)

Ollama’s CUDA backend detects the RTX 3090 automatically. No configuration, no environment variables, no driver debugging. Run ollama run llama3 and it loads the model into VRAM and starts generating. This “it just works” factor matters more than benchmarks for most home lab users who want a local AI tool, not a driver engineering project.

The 24 GB VRAM is the key specification. At Q4_K_M quantization — Ollama’s default for most models — 24 GB fits every model up to about 32B parameters with room for KV cache and context. That covers llama3 8B, mistral 7B, codellama 7B/13B/34B, deepseek-coder 33B, and command-r at 35B. The only popular Ollama models that won’t fit are 70B variants, which need either 32 GB (RTX 5090) or multi-GPU setups.

At ~$800–1,050 on the used market, the RTX 3090 costs roughly a third of a discontinued RTX 4090 at ~$2,755. The speed difference is only 15–20%, and on most Ollama models, both cards feel instant. Anything above 30 tok/s is faster than you can read.

Ollama multi-GPU note: If you can find two RTX 3090s at ~$800 each, Ollama will automatically split model layers across both cards. That gives you 48 GB of effective VRAM — enough for 70B models at full GPU speed. Two 3090s at ~$1,600–2,100 total is a better deal than a single 4090 at ~$2,755, with double the VRAM and roughly matching aggregate throughput.

Buying used tips: Check VRAM junction temps with GPU-Z (under 100C under load). Run a stress test and verify no artifacts. Buy from sellers with at least 30-day returns. The RTX 3090 launched during the mining era, but GPU silicon doesn’t degrade from compute — the main risk is worn-out fans, which are cheap to replace.

For the detailed performance comparison, see RTX 3090 vs 4090 for LLMs.

Best New Card: NVIDIA RTX 4060 Ti 16GB

The RTX 4060 Ti 16GB is the best new GPU you can buy specifically for Ollama if your budget is under $500. Full CUDA, 16 GB VRAM, full warranty, and a 165W TDP that makes it ideal for always-on servers.

Specs: 16 GB GDDR6 · 128-bit bus · 288 GB/s bandwidth · 4,352 CUDA cores · 165W TDP

Ollama Performance:

ollama run llama3: ~89 tok/s generation
ollama run mistral: ~92 tok/s generation
ollama run codellama:13b: ~14 tok/s generation (bandwidth-limited)
ollama run codellama:34b: Does not fit — heavy offloading, ~2–3 tok/s

On 7B–8B models — which represent the majority of what people actually run in Ollama — the 4060 Ti 16GB delivers excellent performance. 89 tok/s on llama3 8B is fast enough that the output streams faster than you can read. Mistral 7B generates at 92 tok/s. For a single-user Ollama setup running smaller models, this card is genuinely great.

The 288 GB/s bandwidth becomes the bottleneck on larger models. Codellama 13B fits in 16 GB VRAM but generates at only ~14 tok/s because each token requires reading the entire model weights from memory. On a 3090 with 936 GB/s, that same 13B model runs at 85 tok/s. If you plan to regularly run 13B+ models, the 4060 Ti is not the right card.

Where the 4060 Ti genuinely excels is power efficiency for always-on Ollama servers. At ~100W actual draw during inference, running 24/7 at $0.15/kWh costs roughly $130/year in electricity. A 3090 at ~200W inference draw costs $260/year. Over a 3-year lifespan, that’s $390 in power savings — meaningful at the budget tier.

The practical recommendation: buy the 4060 Ti 16GB if you primarily run 7B–8B models in Ollama (llama3, mistral, codellama 7B, phi-3), want a new card with warranty, and value low power draw. If you need 13B+ models at usable speed, stretch to a used 3090.

Budget Experiment: Intel Arc B580 — Cheapest Entry Point (With Caveats)

The Intel Arc B580 at ~$250 is the cheapest GPU with enough VRAM to run 7B–8B Ollama models. But “cheapest” and “best” are different things, and the B580 comes with significant caveats that you need to understand before buying.

Specs: 12 GB GDDR6 · 192-bit bus · 456 GB/s bandwidth · 20 Xe-cores · 150W TDP

Ollama Performance (SYCL backend, Linux):

ollama run llama3 (8B): ~25–35 tok/s generation (variable)
ollama run mistral (7B): ~28–38 tok/s generation
ollama run codellama:13b: Does not fit in 12 GB VRAM
Any 13B+ model: Requires CPU offload — unusable speed

The honest assessment: Ollama on Intel Arc is not a first-class experience. Ollama’s Intel GPU support relies on the SYCL backend through llama.cpp, which is functional but experimental. On NVIDIA, you install Ollama and run models. On Intel Arc, you need to install oneAPI, compile llama.cpp with SYCL support, ensure the correct drivers are loaded, and troubleshoot backend detection issues. The Ollama project has made progress on streamlining this, but in March 2026, it still requires more Linux command-line comfort than the CUDA path.

Performance is roughly 3x slower than a CUDA GPU with comparable bandwidth. The Arc B580’s 456 GB/s memory bandwidth is actually higher than the RTX 4060 Ti’s 288 GB/s, but the immature SYCL inference kernels waste much of that theoretical advantage. Where the 4060 Ti gets 89 tok/s on llama3 8B with its narrower bus, the B580 manages only 25–35 tok/s. The software stack matters enormously.

The 12 GB VRAM is the other limitation. It fits 7B–8B models at Q4_K_M quantization, but 13B models exceed what 12 GB can hold. You’re limited to the smaller Ollama model library: llama3 8B, mistral 7B, codellama 7B, phi-3, and gemma 7B. For many users, that’s enough. But if you want to grow into larger models, 12 GB is a dead end.

Who should buy it: The Arc B580 makes sense if you’re budget-constrained under $300, comfortable with Linux, willing to troubleshoot experimental software, and primarily want to run 7B–8B models. It’s a tinkerer’s card — ideal for someone who enjoys the process of getting things working, not someone who wants ollama run to work instantly. If you can stretch to ~$450 for a 4060 Ti 16GB, the CUDA experience is dramatically smoother.

Overkill but Fastest: NVIDIA RTX 4090

The RTX 4090 is the fastest 24 GB card for Ollama and the one you buy if inference speed is the only thing that matters to you.

Specs: 24 GB GDDR6X · 384-bit bus · 1,008 GB/s bandwidth · 16,384 CUDA cores · 450W TDP

Ollama Performance:

ollama run llama3: ~128 tok/s generation, ~7,000–9,100 tok/s prompt processing
ollama run codellama:13b: ~110 tok/s generation
ollama run deepseek-coder:33b-instruct-q4_K_M: ~25–30 tok/s
ollama run llama3:70b-q4_K_M: Requires partial offload — ~5–8 tok/s

The RTX 4090 delivers 15–20% faster token generation than the RTX 3090 across all Ollama models. More importantly, its prompt processing speed is substantially faster, which means initial response time is shorter when you send a long prompt or paste code for codellama to analyze. If you’re running Ollama as a coding assistant with large context windows, you’ll feel the difference on prompt ingestion.

The 4090 also handles concurrent users better than the 3090. If you’re running Open WebUI as a front-end to Ollama and multiple people in your household use it, the extra CUDA cores and bandwidth keep response times consistent under multi-request load. The 3090 handles one user beautifully but slows noticeably under concurrent inference.

The problem is the price. The RTX 4090 was discontinued in late 2024, and remaining stock sells for ~$2,755 — nearly 3x a used 3090. For a single-user Ollama setup, paying 3x more for 15–20% faster generation doesn’t make financial sense. The 4090 only justifies its price if you have a multi-user Ollama server or you specifically need the fastest prompt processing for very long contexts.

At ~$2,755 and climbing, the RTX 4090 is increasingly a card you buy because you want the best, not because it makes economic sense. For most Ollama users, the used 3090 at ~$800–1,050 delivers 85% of the experience for a third of the cost. Check our VRAM guide to see if 24 GB is even enough for your target models.

ROCm Option: AMD Radeon RX 7900 XTX — Honest Assessment

The RX 7900 XTX has 24 GB VRAM and Ollama officially supports it through the ROCm backend. On paper, this should be competitive with the RTX 3090. In practice, the software gap is the story.

Specs: 24 GB GDDR6 · 384-bit bus · 960 GB/s bandwidth · 6,144 stream processors · 355W TDP

Ollama Performance (ROCm, Linux):

ollama run llama3: ~37 tok/s generation
ollama run codellama:13b: ~32 tok/s generation
ollama run deepseek-coder:33b-instruct-q4_K_M: Fits, ~12–15 tok/s
ollama run llama3:70b-q4_K_M: Requires partial offload — ~3–4 tok/s

Let me be direct about what those numbers mean. The RX 7900 XTX has 960 GB/s memory bandwidth — within 3% of the RTX 4090’s 1,008 GB/s. Yet it generates llama3 8B at 37 tok/s versus the 4090’s 128 tok/s. That 3.5x gap is entirely software. ROCm’s inference kernels for GGUF models (the format Ollama uses) are significantly less optimized than CUDA’s. The hardware is not the problem.

Ollama’s ROCm support is officially listed and does work. You install Ollama on a Linux machine with ROCm drivers, and ollama run llama3 runs on the GPU. The experience is more reliable than Intel SYCL — this isn’t experimental. But the performance gap versus CUDA makes the value proposition difficult.

At ~$1,335 new, the 7900 XTX costs more than a used RTX 3090 at ~$800–1,050 that generates tokens 3x faster on the same Ollama models. The math doesn’t work unless you have a specific reason to avoid NVIDIA: open-source principle, existing AMD ecosystem, or a use case where ROCm performance is adequate (37 tok/s on 8B models is still usable for interactive chat).

ROCm on Windows: Not supported in Ollama. If you run Windows, AMD GPUs will use CPU inference, which defeats the purpose entirely. ROCm Ollama acceleration requires Linux.

The honest recommendation: If you already own a 7900 XTX, use it with Ollama on Linux — 37 tok/s on 8B models is fine for personal use. If you’re buying specifically for Ollama, buy a used RTX 3090 instead. The CUDA ecosystem advantage is not theoretical — it’s a 3x real-world performance gap on the models you actually run. See our full NVIDIA vs AMD for LLMs comparison for the detailed breakdown.

Ollama-Specific GPU Considerations

Backend Support: CUDA vs ROCm vs SYCL

Ollama supports three GPU backends, and they are not equal:

Backend	GPU Support	OS Support	Maturity	Relative Performance
CUDA	NVIDIA (all modern)	Windows, Linux, macOS	Production	Baseline (100%)
ROCm	AMD RDNA 3 (7900 XTX, 7900 XT, 7800 XT)	Linux only	Stable	~30–35% of CUDA
SYCL	Intel Arc (A770, A750, B580)	Linux only	Experimental	~25–30% of CUDA

CUDA is the default and by far the most optimized. When Ollama detects an NVIDIA GPU, it uses CUDA automatically. No environment variables, no driver flags, no compilation steps. This matters for home lab use — you want to install Ollama, pull a model, and start chatting.

ROCm works and is officially supported, but requires Linux and delivers roughly a third of CUDA’s throughput on equivalent models. SYCL is the least mature: it works for basic inference on Intel Arc GPUs but requires manual setup and delivers the lowest performance of the three backends.

GGUF Model Sizes and VRAM Requirements

Ollama uses GGUF quantized models. Here’s how much VRAM popular models actually consume:

Model	Q4_K_M Size	Min VRAM (with context)	Fits 12 GB?	Fits 16 GB?	Fits 24 GB?
llama3 8B	~4.9 GB	~6.5 GB	Yes	Yes	Yes
mistral 7B	~4.4 GB	~6.0 GB	Yes	Yes	Yes
codellama 7B	~4.2 GB	~5.8 GB	Yes	Yes	Yes
codellama 13B	~7.9 GB	~10.5 GB	No	Yes	Yes
llama3 13B	~7.9 GB	~10.5 GB	No	Yes	Yes
deepseek-coder 33B	~19.8 GB	~22.5 GB	No	No	Yes
codellama 34B	~20.2 GB	~23.0 GB	No	No	Yes
llama3 70B	~38.5 GB	~42 GB	No	No	No (needs 48 GB+)

The “Min VRAM” column includes overhead for KV cache and Ollama’s runtime. This is the actual VRAM consumption you’ll see with ollama ps, not just the model file size.

Key takeaway: 12 GB (Arc B580) limits you to 7B–8B models. 16 GB (4060 Ti) opens up 13B. 24 GB (3090, 4090, 7900 XTX) unlocks 33B–34B models. For the full VRAM analysis, see our dedicated guide.

GPU Offloading in Ollama

When a model exceeds your VRAM, Ollama doesn’t fail — it offloads layers to CPU/RAM. This sounds helpful but the performance cliff is brutal. GPU memory bandwidth (936 GB/s for a 3090) drops to PCIe bandwidth (~25 GB/s for PCIe 4.0 x16). That’s a 37x slowdown on offloaded layers.

In practice:

0% offload (model fits in VRAM): Full speed, 80–130 tok/s on 8B models
10% offload: Speed drops by roughly 40–50%
25% offload: Speed drops by 70–80%, down to 10–20 tok/s
50%+ offload: Under 5 tok/s — barely usable for interactive chat

The right strategy is to buy enough VRAM for the models you want to run, not to plan on offloading. If you need 13B models, don’t buy a 12 GB card and hope offloading works. Buy 16 GB or 24 GB.

Multi-GPU Support

Ollama supports multi-GPU on NVIDIA CUDA. If you install two RTX 3090s, Ollama automatically splits model layers across both cards, giving you 48 GB of effective VRAM. This enables 70B models at full GPU speed — something no single 24 GB card can do.

Multi-GPU is not supported on ROCm or SYCL in Ollama as of March 2026. If you want multi-GPU Ollama, NVIDIA is your only option.

For multi-GPU, the cards don’t need to be identical, but matching cards simplify load balancing. Two RTX 3090s at ~$1,600–2,100 total give you 48 GB of VRAM — more than a single RTX 5090 at ~$4,000+ street price, at half the cost. If running 70B models in Ollama is your goal, dual 3090s is the value play.

Bottom Line

For most home lab users running Ollama, the used RTX 3090 at ~$800–1,050 is the right GPU. Twenty-four gigabytes of VRAM with full CUDA support means every Ollama model up to 33B parameters runs at interactive speed with zero configuration. ollama run llama3 generates at 112 tok/s. ollama run codellama:13b runs at 85 tok/s. It just works.

If you want the cheapest new CUDA card that handles 7B–8B models well, the RTX 4060 Ti 16GB at ~$450 is excellent for smaller models with the lowest power draw in the group.

The Intel Arc B580 at ~$250 is the absolute budget floor for Ollama, but the experimental SYCL backend means significant setup friction and 3x slower performance than CUDA. Only for tinkerers on Linux.

The RTX 4090 at ~$2,755 is the fastest option but nearly impossible to justify at 3x the price of a used 3090 for 15–20% more speed.

The RX 7900 XTX at ~$1,335 works with Ollama via ROCm on Linux, but 3x slower inference than a cheaper used 3090 makes it hard to recommend for this specific use case.

Whatever GPU you choose, pair it with enough system RAM (32 GB minimum for comfortable Ollama usage) and a fast NVMe drive for model storage. For complete build ideas, see our best mini PC for local AI guide, and check how much VRAM you need for LLMs to match your GPU to the models you want to run.

Our Pick

NVIDIA RTX 3090 (Used)

~$800–1,050

VRAM: 24 GB GDDR6X
Bandwidth: 936 GB/s
TDP: 350W
Ollama Backend: CUDA (native)

24 GB VRAM handles every Ollama model up to 32B at Q4 quantization with full CUDA acceleration. The best dollar-per-token deal for Ollama users in 2026.

24 GB VRAM runs llama3 70B:q4 with partial offload, 32B models fully in VRAM

Full CUDA support — `ollama run` just works out of the box

~112 tok/s on llama3 8B, ~85 tok/s on 13B models

~$800–1,050 used is a fraction of the ~$2,755 RTX 4090

No warranty when buying used — inspect fans and VRAM temps

350W TDP draws meaningful power for always-on Ollama servers

Ex-mining card risk on the used market

~15–20% slower than RTX 4090 on identical models

Check Price on Amazon →

Best Value

NVIDIA RTX 4060 Ti 16GB

~$450

VRAM: 16 GB GDDR6
Bandwidth: 288 GB/s
TDP: 165W
Ollama Backend: CUDA (native)

The cheapest new CUDA GPU that runs 13B Ollama models in VRAM. Low power draw makes it ideal for always-on inference servers running 7B–8B models.

16 GB VRAM fits llama3 8B and mistral 7B with room for context

Full CUDA support — zero Ollama configuration needed

165W TDP — lowest power draw, ideal for 24/7 Ollama servers

~$450 new with full manufacturer warranty

288 GB/s bandwidth bottlenecks 13B models to ~14 tok/s

12 GB models like codellama 13B barely fit with minimal context

Cannot run 32B+ models without heavy offloading

Used RTX 3090 is better value if budget allows ~$800+

Check Price on Amazon →

Budget Pick

Intel Arc B580

~$360

VRAM: 12 GB GDDR6
Bandwidth: 456 GB/s
TDP: 150W
Ollama Backend: SYCL (experimental)

The cheapest 12 GB GPU on the market. Ollama support via SYCL backend is experimental but functional for 7B models on Linux. Not for beginners.

12 GB VRAM fits 7B–8B GGUF models at Q4 quantization

~$250 new — cheapest 12 GB GPU available

456 GB/s bandwidth is better than the RTX 4060 Ti

150W TDP keeps power costs low

Ollama SYCL backend is experimental — expect setup friction

Linux-only for LLM workloads — Windows support is immature

~25–35 tok/s on 8B models, roughly 3x slower than CUDA equivalent

Cannot fit 13B models — 12 GB is the hard ceiling

Check Price on Amazon →

NVIDIA RTX 4090

~$2,755

VRAM: 24 GB GDDR6X
Bandwidth: 1,008 GB/s
TDP: 450W
Ollama Backend: CUDA (native)

The fastest 24 GB GPU for Ollama. Maximum tok/s on every model, but the discontinued price tag at ~$2,755 makes it hard to justify over a used RTX 3090.

~128 tok/s on llama3 8B — fastest 24 GB card for Ollama

1,008 GB/s bandwidth keeps 13B+ models fast

24 GB VRAM handles the same model range as the RTX 3090

Better power efficiency per token than the 3090

~$2,755 discontinued — prices rising as stock dwindles

Only 15–20% faster than a used 3090 at 3x the price

450W TDP requires robust PSU and cooling

No longer in production — limited future availability

Check Price on Amazon →

AMD Radeon RX 7900 XTX

~$1,300

VRAM: 24 GB GDDR6
Bandwidth: 960 GB/s
TDP: 355W
Ollama Backend: ROCm

24 GB VRAM with ROCm support in Ollama. The hardware is competitive but ROCm inference speed lags CUDA significantly on identical models.

24 GB VRAM — same model capacity as RTX 3090 and 4090

960 GB/s bandwidth is competitive on paper

Ollama officially supports ROCm backend

Available new with warranty unlike used 3090s

~37 tok/s on llama3 8B via ROCm — 3x slower than RTX 3090 on CUDA

~$1,335 is more expensive than a used 3090 that outperforms it

ROCm requires Linux — Windows Ollama users need CUDA

Fewer community resources for troubleshooting Ollama on AMD

Check Price on Amazon →

Frequently Asked Questions

Does Ollama support AMD GPUs?

Yes — Ollama officially supports AMD GPUs via the ROCm backend on Linux. The RX 7900 XTX, 7900 XT, and 7800 XT all work. However, ROCm inference is significantly slower than CUDA on equivalent hardware. Expect roughly 3x slower token generation compared to an NVIDIA card with similar specs. Windows support for ROCm in Ollama is not yet available.

Does Ollama work with Intel Arc GPUs?

Experimentally, yes. Ollama can use Intel Arc GPUs through the SYCL backend via llama.cpp's SYCL support. The Arc B580 with 12 GB VRAM can run 7B–8B models. However, setup requires manual compilation on Linux, performance is roughly 3x slower than CUDA, and you should expect compatibility issues. This is not a plug-and-play experience.

How much VRAM do I need for Ollama?

For 7B–8B models (llama3, mistral, codellama 7B): 8–12 GB minimum. For 13B models: 16 GB minimum. For 32B models (codellama 34B, deepseek-coder 33B): 24 GB. For 70B models: 32 GB to fit entirely, or 24 GB with partial CPU offloading at reduced speed. Most Ollama users are best served by 24 GB, which covers the widest range of models at interactive speeds.

Can I use multiple GPUs with Ollama?

Ollama supports multi-GPU inference on NVIDIA GPUs via CUDA. If you have two RTX 3090s, Ollama will automatically split model layers across both cards. This effectively doubles your VRAM (48 GB across two 24 GB cards), enabling 70B models at full GPU speed. Multi-GPU is not yet supported on AMD ROCm or Intel SYCL backends in Ollama.

Why is my Ollama model running slowly?

The most common cause is partial CPU offloading. If your model exceeds GPU VRAM, Ollama offloads layers to system RAM, which drops speed from 80–130 tok/s to 2–5 tok/s. Run `ollama ps` to check VRAM usage. Other causes include running on CPU-only mode (no GPU detected), using an AMD GPU on Windows (ROCm requires Linux), or running a quantization level that is too high for your VRAM.

Get our weekly picks

The best home lab deals and new reviews, every week. Free, no spam.

Join home lab builders who get deals first.