Skip to content

How Much VRAM Do You Need for Local LLMs in 2026?

· 8 min read

The single most important number when shopping for a local LLM GPU is VRAM. Not CUDA cores, not clock speed, not TDP — VRAM. If the model doesn’t fit in your GPU’s memory, it either won’t run at all or it’ll partially offload to your CPU and run painfully slowly.

This guide gives you the exact numbers: how much VRAM each model size actually needs, how quantization changes the equation, and which GPUs make sense at each budget.

The Quick Math: How VRAM Requirements Work

The formula is straightforward:

VRAM ≈ (Parameters × Bytes per Weight) + Context Overhead

A “7B” model has 7 billion parameters. Each parameter is stored as a number. The size of that number depends on the precision (quantization level):

  • FP16 (full precision): 2 bytes per parameter
  • Q8_0 (8-bit): ~1.1 bytes per parameter
  • Q5_K_M (5-bit): ~0.75 bytes per parameter
  • Q4_K_M (4-bit): ~0.6 bytes per parameter

So a 7B model at FP16: 7 billion × 2 bytes = 14 GB. The same model at Q4_K_M: 7 billion × 0.6 bytes ≈ 4.2 GB. That’s a 70% reduction in VRAM for a small quality tradeoff.

On top of the model weights, you need memory for the KV cache (which grows with context length) and framework overhead. At default 4K context, budget an extra 0.5–1.5 GB. At 32K context, budget 2–6 GB extra.

VRAM Requirements by Model Size and Quantization

Here’s the table you’re looking for. These numbers include a reasonable buffer for KV cache at 4K–8K context length.

Model SizeQ4_K_MQ5_K_MQ8_0FP16
7B~5 GB~6 GB~9 GB~15 GB
13B~8 GB~10 GB~15 GB~27 GB
30B~18 GB~22 GB~34 GB~62 GB
70B~40 GB~48 GB~78 GB~142 GB
120B~68 GB~82 GB~134 GB~242 GB

A few things jump out from this table:

7B models are the sweet spot for budget GPUs. A 7B Q4_K_M fits comfortably on an 8 GB card. Models like Mistral 7B, Llama 3.1 8B, and Qwen 2.5 7B deliver genuinely useful results at this size — coding assistance, summarization, and conversational AI all work well.

13B models need 8–10 GB minimum. This is where 12 GB and 16 GB cards earn their keep. The jump from 7B to 13B is noticeable for complex reasoning and instruction following.

70B models exceed any single consumer GPU at FP16. Even at Q4_K_M, a 70B model needs ~40 GB — more than the RTX 5090’s 32 GB. You’re looking at workstation GPUs, multi-GPU setups, or aggressive quantization with partial CPU offloading.

Quantization: The VRAM–Quality Tradeoff

Not all quantization levels are equal. Here’s what each level actually means for quality:

Q4_K_M — The community default for a reason. Reduces model size to roughly 30% of FP16 with minimal quality loss on most tasks. If your VRAM is limited, start here. You’ll lose some nuance on complex multi-step reasoning, but for code generation, chat, and summarization, the difference from FP16 is marginal.

Q5_K_M — A noticeable step up from Q4 in reasoning benchmarks, at about 25% more VRAM. Worth it if you have the headroom. This is the sweet spot for quality-conscious users who can afford the extra memory.

Q8_0 — Near-lossless for most tasks. Roughly 55% of FP16 size. The quality difference from FP16 is almost unmeasurable in blind testing. Choose this when VRAM allows and you want maximum quality without the cost of full precision.

FP16 — Full precision. The baseline everything else is measured against. Only practical for 7B models on consumer hardware (15 GB fits on a 16 GB card with minimal context). For larger models, you need professional or multi-GPU setups.

The practical advice: start with Q4_K_M. If the model fits with room to spare, try Q5_K_M. Only go higher if you have specific quality requirements and the VRAM budget to match.

Context Length: The Hidden VRAM Tax

Every token in your context window requires VRAM for the KV (key-value) cache. This scales with both model size and context length.

Context Length7B Model Overhead13B Model Overhead70B Model Overhead
4K tokens~0.3 GB~0.5 GB~2 GB
8K tokens~0.5 GB~1 GB~4 GB
32K tokens~2 GB~4 GB~14 GB
128K tokens~8 GB~14 GB~50 GB+

This is why the VRAM table above uses 4K–8K context as the baseline. If you’re doing RAG (retrieval-augmented generation) with large document contexts, or using models with 128K context windows, you need significantly more VRAM than the model weights alone suggest.

For most home lab use cases — chatting, coding, summarization — 4K–8K context is plenty. If you need long-context processing, budget accordingly or look at techniques like context window sliding.

What Happens When You Run Out of VRAM

When a model doesn’t fully fit in GPU memory, inference frameworks like llama.cpp and Ollama don’t just crash. They offload layers to system RAM and run those layers on the CPU.

The speed penalty is severe. GPU inference at full VRAM residence typically delivers 30–130+ tokens per second depending on the GPU and model. CPU-offloaded layers run at 2–8 tokens per second on a typical desktop CPU.

If 80% of the model fits in VRAM and 20% offloads to CPU, your effective speed drops to roughly 40–60% of full-GPU speed. At 50/50 split, expect around 20–30% of full-GPU speed. The relationship isn’t linear because the CPU-bound layers become the bottleneck.

Partial offloading is useful for testing models, not for daily use. If you want to evaluate whether a 70B model is worth upgrading your GPU for, offloading lets you try it. But if you’re generating hundreds of responses a day, you need the model to fit entirely in VRAM.

For a deeper look at performance benchmarks, see best GPU for local LLMs.

8 GB — Entry Level (~$200–350)

Cards like the RTX 3060 8GB and RTX 4060 give you enough room for 7B models at Q4_K_M with short context. This is the minimum viable VRAM for local LLMs. You can run Llama 3.1 8B, Mistral 7B, and Phi-3 Mini comfortably. Anything larger requires quantization below Q4 or aggressive context limiting.

Best for: Experimenting with local AI, running small coding assistants, lightweight chatbots.

12 GB — Capable Budget Tier (~$250–400)

The RTX 3060 12GB is the cheapest way to get 12 GB of VRAM. It handles 7B models at Q8_0, 13B models at Q4_K_M with tight context, and provides a meaningful upgrade over 8 GB cards. The extra 4 GB opens up the entire 13B model class.

Best for: Running 13B models for better quality, longer context windows on 7B models, serious hobbyist use.

16 GB — The Sweet Spot (~$400–500)

The RTX 4060 Ti 16GB is the standout here. 16 GB fits 13B models at Q5_K_M with comfortable context headroom, and even allows 30B models at aggressive Q4 quantization with limited context. This is where local LLMs start feeling genuinely useful for daily work. For budget alternatives in this tier, see best budget GPU for local AI.

Best for: Daily coding assistant, 13B models with long context, first-time serious LLM users.

24 GB — The Home Lab Standard (~$1,335–2,800)

This is the tier where model selection stops being a constraint for most use cases. The RTX 3090 (used) at ~$1,730 is the most accessible 24 GB NVIDIA option — 24 GB of GDDR6X at a significant discount versus the RTX 4090’s ~$2,800 price. Both cards handle 30B models at Q4_K_M, 13B at Q8_0 or FP16, and 70B models with partial offloading.

The RX 7900 XTX also offers 24 GB at ~$1,335, though ROCm support for LLM frameworks still trails CUDA. See NVIDIA vs AMD for LLMs for the full comparison.

For a head-to-head of the two most popular 24 GB options, see RTX 3090 vs 4090 for LLMs.

Best for: Running the full range of models up to 30B at high quality, 70B with quantization, production-quality home inference.

32 GB — Flagship Consumer (~$2,000+)

The RTX 5090 brings 32 GB of GDDR7 with 1,792 GB/s bandwidth. This is enough for 70B models at Q4_K_M with limited context (the ~40 GB requirement exceeds 32 GB, so you’ll offload a few layers). The real advantage is running 30B models at Q8_0 or higher quantization with generous context windows. If you need Ollama to handle multiple concurrent requests, 32 GB gives you the buffer.

Best for: Power users running large models, multi-model serving, future-proofing.

48 GB+ — Workstation and Multi-GPU (~$2,500+)

The NVIDIA RTX A6000 (48 GB) and dual-GPU setups enter territory where 70B models at Q4_K_M fit entirely in VRAM. At this level, you can also run 120B+ models with quantization. These are workstation-class cards at workstation-class prices — the A6000 runs ~$3,500+ used.

Most home lab users don’t need this tier. But if you’re running LLMs as a service for your household or working on fine-tuning, 48 GB removes nearly all model-size constraints.

Best for: 70B models without offloading, 120B+ models with quantization, fine-tuning, multi-user serving.

The Decision Framework

If you’re still unsure, answer these two questions:

What’s the largest model you want to run? Check the VRAM table above at Q4_K_M — that’s your minimum VRAM requirement.

What context length do you need? If it’s under 8K tokens (most people), the table numbers work as-is. If you need 32K+, add the context overhead from the second table.

Buy the GPU that covers both numbers with at least 1–2 GB to spare. Don’t buy more than one tier above what you need — VRAM requirements grow with new models, but so does quantization efficiency. The GPU that’s “enough” today will likely stay “enough” for the next generation of models at the same parameter count.

For specific GPU recommendations with benchmark data, see our complete guide to the best GPUs for local LLMs. For the best value on a budget, see best budget GPU for local AI. And for Ollama-specific setup advice, see best GPU for Ollama.

Frequently Asked Questions

Can I run a 70B parameter model on a single GPU?
Only with heavy quantization. A 70B model at Q4_K_M needs ~40 GB of VRAM, which exceeds every single consumer GPU. You'd need a dual-GPU setup, a workstation card like the RTX A6000 (48 GB), or accept CPU offloading with a significant speed penalty.
Is Q4 quantization good enough for local LLMs?
For most home lab use cases, yes. Q4_K_M is the sweet spot — it reduces VRAM usage by roughly 75% compared to FP16 while retaining the vast majority of model quality. Benchmarks show negligible perplexity loss for instruction-following and conversation tasks. You lose some nuance on complex reasoning, but the VRAM savings are enormous.
Does context length affect VRAM usage?
Yes. Longer context windows require more VRAM for the KV cache. At 4K context, overhead is minimal (under 1 GB for most models). At 32K context, expect an additional 2–6 GB depending on model architecture. At 128K context, the KV cache alone can consume 8–16 GB on larger models.
What happens if my model doesn't fit in VRAM?
Inference frameworks like llama.cpp and Ollama can offload layers to system RAM. The model still runs, but every layer on the CPU runs 5–15x slower than on the GPU. If half the model is offloaded, expect roughly half the generation speed. It works for testing but isn't practical for regular use.
Should I buy two 12 GB GPUs or one 24 GB GPU?
One 24 GB GPU, every time. LLM inference doesn't split efficiently across GPUs the way training does. Two 12 GB cards give you 24 GB total, but the overhead of cross-GPU communication and the fact that some layers can't be cleanly split means real-world performance is worse than a single 24 GB card. Buy the biggest single GPU you can afford.

Get our weekly picks

The best home lab deals and new reviews, every week. Free, no spam.

Join home lab builders who get deals first.