Best GPU for Ollama in 2026: 5 Picks That Actually Work
NVIDIA RTX 3090 (Used)
~$800–1,05024 GB VRAM with full CUDA support means every Ollama model just works. At ~$800–1,050 used, nothing else comes close on value.
| ★ RTX 3090 (Used) Our Pick | RTX 4060 Ti 16GB Best New Card | Intel Arc B580 Budget Experiment | RTX 4090 Fastest | RX 7900 XTX Best AMD | |
|---|---|---|---|---|---|
| VRAM | 24 GB GDDR6X | 16 GB GDDR6 | 12 GB GDDR6 | 24 GB GDDR6X | 24 GB GDDR6 |
| Bandwidth | 936 GB/s | 288 GB/s | 456 GB/s | 1,008 GB/s | 960 GB/s |
| Ollama Backend | CUDA (native) | CUDA (native) | SYCL (experimental) | CUDA (native) | ROCm |
| llama3 8B tok/s | ~112 | ~89 | ~25–35* | ~128 | ~37 |
| codellama 13B tok/s | ~85 | ~14 | Does not fit | ~110 | ~32 |
| Price | ~$800–1,050 | ~$450 | ~$360 | ~$2,755 | ~$1,300 |
| Check Price → | Check Price → | Check Price → | Check Price → | Check Price → |
Ollama makes running local LLMs simple — ollama run llama3 and you’re generating text. But that simplicity hides a critical dependency: your GPU determines whether you get 130 tokens per second or 3. Pick the wrong card and Ollama silently offloads model layers to system RAM, dropping you from interactive chat speed to something that feels like waiting for a dial-up page to load.
The GPU choice for Ollama is more nuanced than for general LLM inference because Ollama has specific backend requirements. CUDA works out of the box on NVIDIA cards. ROCm works on AMD but only on Linux and with significantly lower performance. Intel’s SYCL backend is experimental at best. Your GPU purchase is also a backend decision.
This guide covers five GPUs that actually work with Ollama in March 2026, with real ollama run performance numbers on the models people actually use: llama3, mistral, codellama, and deepseek-coder. For a broader look at GPUs for all local inference frameworks, see our best GPU for local LLMs guide.
Our Pick: NVIDIA RTX 3090 (Used) — Best Overall for Ollama
The used RTX 3090 is the best GPU for most Ollama users. It has 24 GB of VRAM, full CUDA support, and costs a fraction of what you’d pay for comparable NVIDIA cards still in production.
Specs: 24 GB GDDR6X · 384-bit bus · 936 GB/s bandwidth · 10,496 CUDA cores · 350W TDP
Ollama Performance:
ollama run llama3: ~112 tok/s generationollama run codellama:13b: ~85 tok/s generationollama run deepseek-coder:33b-instruct-q4_K_M: Fits in 24 GB, ~20–25 tok/sollama run llama3:70b-q4_K_M: Requires partial offload — ~4–6 tok/s (not recommended)
Ollama’s CUDA backend detects the RTX 3090 automatically. No configuration, no environment variables, no driver debugging. Run ollama run llama3 and it loads the model into VRAM and starts generating. This “it just works” factor matters more than benchmarks for most home lab users who want a local AI tool, not a driver engineering project.
The 24 GB VRAM is the key specification. At Q4_K_M quantization — Ollama’s default for most models — 24 GB fits every model up to about 32B parameters with room for KV cache and context. That covers llama3 8B, mistral 7B, codellama 7B/13B/34B, deepseek-coder 33B, and command-r at 35B. The only popular Ollama models that won’t fit are 70B variants, which need either 32 GB (RTX 5090) or multi-GPU setups.
At ~$800–1,050 on the used market, the RTX 3090 costs roughly a third of a discontinued RTX 4090 at ~$2,755. The speed difference is only 15–20%, and on most Ollama models, both cards feel instant. Anything above 30 tok/s is faster than you can read.
Ollama multi-GPU note: If you can find two RTX 3090s at ~$800 each, Ollama will automatically split model layers across both cards. That gives you 48 GB of effective VRAM — enough for 70B models at full GPU speed. Two 3090s at ~$1,600–2,100 total is a better deal than a single 4090 at ~$2,755, with double the VRAM and roughly matching aggregate throughput.
Buying used tips: Check VRAM junction temps with GPU-Z (under 100C under load). Run a stress test and verify no artifacts. Buy from sellers with at least 30-day returns. The RTX 3090 launched during the mining era, but GPU silicon doesn’t degrade from compute — the main risk is worn-out fans, which are cheap to replace.
For the detailed performance comparison, see RTX 3090 vs 4090 for LLMs.
Best New Card: NVIDIA RTX 4060 Ti 16GB
The RTX 4060 Ti 16GB is the best new GPU you can buy specifically for Ollama if your budget is under $500. Full CUDA, 16 GB VRAM, full warranty, and a 165W TDP that makes it ideal for always-on servers.
Specs: 16 GB GDDR6 · 128-bit bus · 288 GB/s bandwidth · 4,352 CUDA cores · 165W TDP
Ollama Performance:
ollama run llama3: ~89 tok/s generationollama run mistral: ~92 tok/s generationollama run codellama:13b: ~14 tok/s generation (bandwidth-limited)ollama run codellama:34b: Does not fit — heavy offloading, ~2–3 tok/s
On 7B–8B models — which represent the majority of what people actually run in Ollama — the 4060 Ti 16GB delivers excellent performance. 89 tok/s on llama3 8B is fast enough that the output streams faster than you can read. Mistral 7B generates at 92 tok/s. For a single-user Ollama setup running smaller models, this card is genuinely great.
The 288 GB/s bandwidth becomes the bottleneck on larger models. Codellama 13B fits in 16 GB VRAM but generates at only ~14 tok/s because each token requires reading the entire model weights from memory. On a 3090 with 936 GB/s, that same 13B model runs at 85 tok/s. If you plan to regularly run 13B+ models, the 4060 Ti is not the right card.
Where the 4060 Ti genuinely excels is power efficiency for always-on Ollama servers. At ~100W actual draw during inference, running 24/7 at $0.15/kWh costs roughly $130/year in electricity. A 3090 at ~200W inference draw costs $260/year. Over a 3-year lifespan, that’s $390 in power savings — meaningful at the budget tier.
The practical recommendation: buy the 4060 Ti 16GB if you primarily run 7B–8B models in Ollama (llama3, mistral, codellama 7B, phi-3), want a new card with warranty, and value low power draw. If you need 13B+ models at usable speed, stretch to a used 3090.
Budget Experiment: Intel Arc B580 — Cheapest Entry Point (With Caveats)
The Intel Arc B580 at ~$250 is the cheapest GPU with enough VRAM to run 7B–8B Ollama models. But “cheapest” and “best” are different things, and the B580 comes with significant caveats that you need to understand before buying.
Specs: 12 GB GDDR6 · 192-bit bus · 456 GB/s bandwidth · 20 Xe-cores · 150W TDP
Ollama Performance (SYCL backend, Linux):
ollama run llama3(8B): ~25–35 tok/s generation (variable)ollama run mistral(7B): ~28–38 tok/s generationollama run codellama:13b: Does not fit in 12 GB VRAM- Any 13B+ model: Requires CPU offload — unusable speed
The honest assessment: Ollama on Intel Arc is not a first-class experience. Ollama’s Intel GPU support relies on the SYCL backend through llama.cpp, which is functional but experimental. On NVIDIA, you install Ollama and run models. On Intel Arc, you need to install oneAPI, compile llama.cpp with SYCL support, ensure the correct drivers are loaded, and troubleshoot backend detection issues. The Ollama project has made progress on streamlining this, but in March 2026, it still requires more Linux command-line comfort than the CUDA path.
Performance is roughly 3x slower than a CUDA GPU with comparable bandwidth. The Arc B580’s 456 GB/s memory bandwidth is actually higher than the RTX 4060 Ti’s 288 GB/s, but the immature SYCL inference kernels waste much of that theoretical advantage. Where the 4060 Ti gets 89 tok/s on llama3 8B with its narrower bus, the B580 manages only 25–35 tok/s. The software stack matters enormously.
The 12 GB VRAM is the other limitation. It fits 7B–8B models at Q4_K_M quantization, but 13B models exceed what 12 GB can hold. You’re limited to the smaller Ollama model library: llama3 8B, mistral 7B, codellama 7B, phi-3, and gemma 7B. For many users, that’s enough. But if you want to grow into larger models, 12 GB is a dead end.
Who should buy it: The Arc B580 makes sense if you’re budget-constrained under $300, comfortable with Linux, willing to troubleshoot experimental software, and primarily want to run 7B–8B models. It’s a tinkerer’s card — ideal for someone who enjoys the process of getting things working, not someone who wants ollama run to work instantly. If you can stretch to ~$450 for a 4060 Ti 16GB, the CUDA experience is dramatically smoother.
Overkill but Fastest: NVIDIA RTX 4090
The RTX 4090 is the fastest 24 GB card for Ollama and the one you buy if inference speed is the only thing that matters to you.
Specs: 24 GB GDDR6X · 384-bit bus · 1,008 GB/s bandwidth · 16,384 CUDA cores · 450W TDP
Ollama Performance:
ollama run llama3: ~128 tok/s generation, ~7,000–9,100 tok/s prompt processingollama run codellama:13b: ~110 tok/s generationollama run deepseek-coder:33b-instruct-q4_K_M: ~25–30 tok/sollama run llama3:70b-q4_K_M: Requires partial offload — ~5–8 tok/s
The RTX 4090 delivers 15–20% faster token generation than the RTX 3090 across all Ollama models. More importantly, its prompt processing speed is substantially faster, which means initial response time is shorter when you send a long prompt or paste code for codellama to analyze. If you’re running Ollama as a coding assistant with large context windows, you’ll feel the difference on prompt ingestion.
The 4090 also handles concurrent users better than the 3090. If you’re running Open WebUI as a front-end to Ollama and multiple people in your household use it, the extra CUDA cores and bandwidth keep response times consistent under multi-request load. The 3090 handles one user beautifully but slows noticeably under concurrent inference.
The problem is the price. The RTX 4090 was discontinued in late 2024, and remaining stock sells for ~$2,755 — nearly 3x a used 3090. For a single-user Ollama setup, paying 3x more for 15–20% faster generation doesn’t make financial sense. The 4090 only justifies its price if you have a multi-user Ollama server or you specifically need the fastest prompt processing for very long contexts.
At ~$2,755 and climbing, the RTX 4090 is increasingly a card you buy because you want the best, not because it makes economic sense. For most Ollama users, the used 3090 at ~$800–1,050 delivers 85% of the experience for a third of the cost. Check our VRAM guide to see if 24 GB is even enough for your target models.
ROCm Option: AMD Radeon RX 7900 XTX — Honest Assessment
The RX 7900 XTX has 24 GB VRAM and Ollama officially supports it through the ROCm backend. On paper, this should be competitive with the RTX 3090. In practice, the software gap is the story.
Specs: 24 GB GDDR6 · 384-bit bus · 960 GB/s bandwidth · 6,144 stream processors · 355W TDP
Ollama Performance (ROCm, Linux):
ollama run llama3: ~37 tok/s generationollama run codellama:13b: ~32 tok/s generationollama run deepseek-coder:33b-instruct-q4_K_M: Fits, ~12–15 tok/sollama run llama3:70b-q4_K_M: Requires partial offload — ~3–4 tok/s
Let me be direct about what those numbers mean. The RX 7900 XTX has 960 GB/s memory bandwidth — within 3% of the RTX 4090’s 1,008 GB/s. Yet it generates llama3 8B at 37 tok/s versus the 4090’s 128 tok/s. That 3.5x gap is entirely software. ROCm’s inference kernels for GGUF models (the format Ollama uses) are significantly less optimized than CUDA’s. The hardware is not the problem.
Ollama’s ROCm support is officially listed and does work. You install Ollama on a Linux machine with ROCm drivers, and ollama run llama3 runs on the GPU. The experience is more reliable than Intel SYCL — this isn’t experimental. But the performance gap versus CUDA makes the value proposition difficult.
At ~$1,335 new, the 7900 XTX costs more than a used RTX 3090 at ~$800–1,050 that generates tokens 3x faster on the same Ollama models. The math doesn’t work unless you have a specific reason to avoid NVIDIA: open-source principle, existing AMD ecosystem, or a use case where ROCm performance is adequate (37 tok/s on 8B models is still usable for interactive chat).
ROCm on Windows: Not supported in Ollama. If you run Windows, AMD GPUs will use CPU inference, which defeats the purpose entirely. ROCm Ollama acceleration requires Linux.
The honest recommendation: If you already own a 7900 XTX, use it with Ollama on Linux — 37 tok/s on 8B models is fine for personal use. If you’re buying specifically for Ollama, buy a used RTX 3090 instead. The CUDA ecosystem advantage is not theoretical — it’s a 3x real-world performance gap on the models you actually run. See our full NVIDIA vs AMD for LLMs comparison for the detailed breakdown.
Ollama-Specific GPU Considerations
Backend Support: CUDA vs ROCm vs SYCL
Ollama supports three GPU backends, and they are not equal:
| Backend | GPU Support | OS Support | Maturity | Relative Performance |
|---|---|---|---|---|
| CUDA | NVIDIA (all modern) | Windows, Linux, macOS | Production | Baseline (100%) |
| ROCm | AMD RDNA 3 (7900 XTX, 7900 XT, 7800 XT) | Linux only | Stable | ~30–35% of CUDA |
| SYCL | Intel Arc (A770, A750, B580) | Linux only | Experimental | ~25–30% of CUDA |
CUDA is the default and by far the most optimized. When Ollama detects an NVIDIA GPU, it uses CUDA automatically. No environment variables, no driver flags, no compilation steps. This matters for home lab use — you want to install Ollama, pull a model, and start chatting.
ROCm works and is officially supported, but requires Linux and delivers roughly a third of CUDA’s throughput on equivalent models. SYCL is the least mature: it works for basic inference on Intel Arc GPUs but requires manual setup and delivers the lowest performance of the three backends.
GGUF Model Sizes and VRAM Requirements
Ollama uses GGUF quantized models. Here’s how much VRAM popular models actually consume:
| Model | Q4_K_M Size | Min VRAM (with context) | Fits 12 GB? | Fits 16 GB? | Fits 24 GB? |
|---|---|---|---|---|---|
| llama3 8B | ~4.9 GB | ~6.5 GB | Yes | Yes | Yes |
| mistral 7B | ~4.4 GB | ~6.0 GB | Yes | Yes | Yes |
| codellama 7B | ~4.2 GB | ~5.8 GB | Yes | Yes | Yes |
| codellama 13B | ~7.9 GB | ~10.5 GB | No | Yes | Yes |
| llama3 13B | ~7.9 GB | ~10.5 GB | No | Yes | Yes |
| deepseek-coder 33B | ~19.8 GB | ~22.5 GB | No | No | Yes |
| codellama 34B | ~20.2 GB | ~23.0 GB | No | No | Yes |
| llama3 70B | ~38.5 GB | ~42 GB | No | No | No (needs 48 GB+) |
The “Min VRAM” column includes overhead for KV cache and Ollama’s runtime. This is the actual VRAM consumption you’ll see with ollama ps, not just the model file size.
Key takeaway: 12 GB (Arc B580) limits you to 7B–8B models. 16 GB (4060 Ti) opens up 13B. 24 GB (3090, 4090, 7900 XTX) unlocks 33B–34B models. For the full VRAM analysis, see our dedicated guide.
GPU Offloading in Ollama
When a model exceeds your VRAM, Ollama doesn’t fail — it offloads layers to CPU/RAM. This sounds helpful but the performance cliff is brutal. GPU memory bandwidth (936 GB/s for a 3090) drops to PCIe bandwidth (~25 GB/s for PCIe 4.0 x16). That’s a 37x slowdown on offloaded layers.
In practice:
- 0% offload (model fits in VRAM): Full speed, 80–130 tok/s on 8B models
- 10% offload: Speed drops by roughly 40–50%
- 25% offload: Speed drops by 70–80%, down to 10–20 tok/s
- 50%+ offload: Under 5 tok/s — barely usable for interactive chat
The right strategy is to buy enough VRAM for the models you want to run, not to plan on offloading. If you need 13B models, don’t buy a 12 GB card and hope offloading works. Buy 16 GB or 24 GB.
Multi-GPU Support
Ollama supports multi-GPU on NVIDIA CUDA. If you install two RTX 3090s, Ollama automatically splits model layers across both cards, giving you 48 GB of effective VRAM. This enables 70B models at full GPU speed — something no single 24 GB card can do.
Multi-GPU is not supported on ROCm or SYCL in Ollama as of March 2026. If you want multi-GPU Ollama, NVIDIA is your only option.
For multi-GPU, the cards don’t need to be identical, but matching cards simplify load balancing. Two RTX 3090s at ~$1,600–2,100 total give you 48 GB of VRAM — more than a single RTX 5090 at ~$4,000+ street price, at half the cost. If running 70B models in Ollama is your goal, dual 3090s is the value play.
Bottom Line
For most home lab users running Ollama, the used RTX 3090 at ~$800–1,050 is the right GPU. Twenty-four gigabytes of VRAM with full CUDA support means every Ollama model up to 33B parameters runs at interactive speed with zero configuration. ollama run llama3 generates at 112 tok/s. ollama run codellama:13b runs at 85 tok/s. It just works.
If you want the cheapest new CUDA card that handles 7B–8B models well, the RTX 4060 Ti 16GB at ~$450 is excellent for smaller models with the lowest power draw in the group.
The Intel Arc B580 at ~$250 is the absolute budget floor for Ollama, but the experimental SYCL backend means significant setup friction and 3x slower performance than CUDA. Only for tinkerers on Linux.
The RTX 4090 at ~$2,755 is the fastest option but nearly impossible to justify at 3x the price of a used 3090 for 15–20% more speed.
The RX 7900 XTX at ~$1,335 works with Ollama via ROCm on Linux, but 3x slower inference than a cheaper used 3090 makes it hard to recommend for this specific use case.
Whatever GPU you choose, pair it with enough system RAM (32 GB minimum for comfortable Ollama usage) and a fast NVMe drive for model storage. For complete build ideas, see our best mini PC for local AI guide, and check how much VRAM you need for LLMs to match your GPU to the models you want to run.
NVIDIA RTX 3090 (Used)
~$800–1,050- VRAM
- 24 GB GDDR6X
- Bandwidth
- 936 GB/s
- TDP
- 350W
- Ollama Backend
- CUDA (native)
24 GB VRAM handles every Ollama model up to 32B at Q4 quantization with full CUDA acceleration. The best dollar-per-token deal for Ollama users in 2026.
NVIDIA RTX 4060 Ti 16GB
~$450- VRAM
- 16 GB GDDR6
- Bandwidth
- 288 GB/s
- TDP
- 165W
- Ollama Backend
- CUDA (native)
The cheapest new CUDA GPU that runs 13B Ollama models in VRAM. Low power draw makes it ideal for always-on inference servers running 7B–8B models.
Intel Arc B580
~$360- VRAM
- 12 GB GDDR6
- Bandwidth
- 456 GB/s
- TDP
- 150W
- Ollama Backend
- SYCL (experimental)
The cheapest 12 GB GPU on the market. Ollama support via SYCL backend is experimental but functional for 7B models on Linux. Not for beginners.
NVIDIA RTX 4090
~$2,755- VRAM
- 24 GB GDDR6X
- Bandwidth
- 1,008 GB/s
- TDP
- 450W
- Ollama Backend
- CUDA (native)
The fastest 24 GB GPU for Ollama. Maximum tok/s on every model, but the discontinued price tag at ~$2,755 makes it hard to justify over a used RTX 3090.
AMD Radeon RX 7900 XTX
~$1,300- VRAM
- 24 GB GDDR6
- Bandwidth
- 960 GB/s
- TDP
- 355W
- Ollama Backend
- ROCm
24 GB VRAM with ROCm support in Ollama. The hardware is competitive but ROCm inference speed lags CUDA significantly on identical models.
Frequently Asked Questions
Does Ollama support AMD GPUs?
Does Ollama work with Intel Arc GPUs?
How much VRAM do I need for Ollama?
Can I use multiple GPUs with Ollama?
Why is my Ollama model running slowly?
Get our weekly picks
The best home lab deals and new reviews, every week. Free, no spam.
Join home lab builders who get deals first.