How to Build a Home Lab for AI on a Budget
Running AI models locally stopped being a niche hobby sometime in 2025. Models got smaller and faster. Quantization got better. Ollama made the software side trivial. The only question left is hardware — and hardware is where people waste money.
This guide is about building the minimum viable AI inference setup: the cheapest path to running useful local LLMs at acceptable speed, without buying hardware you don’t need.
If you’re already running a home lab, you may have half the parts. If you’re starting from zero, you can have a working local AI setup for under $800.
GPU: The Only Component That Matters
For local LLM inference, the GPU is the build. Everything else — CPU, RAM, case — is commodity hardware that exists to feed the GPU. The two numbers that matter are VRAM (how big a model you can load) and memory bandwidth (how fast it generates tokens).
If you haven’t read How Much VRAM Do You Need for Local LLMs?, start there. It covers the math in detail. The short version:
- 7B–8B models (Llama 3.1 8B, Mistral 7B, Qwen 3 8B): Need ~5–6 GB VRAM at Q4 quantization
- 13B–14B models (Qwen 2.5 14B): Need ~8–10 GB VRAM at Q4
- 30B–34B models (Qwen 2.5 32B, DeepSeek-R1 32B): Need ~20–22 GB VRAM at Q4
- 70B models (Llama 3.1 70B): Need ~40 GB VRAM at Q4 — no single consumer GPU can do this
That makes the budget decision straightforward. Here are the tiers:
Tier 1: Entry — Used RTX 3060 12GB (~$428)
The NVIDIA RTX 3060 12GB (Used) is still the cheapest way into GPU-accelerated inference with 12 GB of VRAM. It runs 7B–8B models comfortably at Q4 or Q8 quantization, and 13B models at Q4 with shorter context windows. Token generation speed is reasonable: roughly 50 tok/s on 7B Q4 and 18 tok/s on 13B Q4.
The ceiling is real. You cannot run anything larger than 13B without heavy CPU offloading, which defeats the purpose. At ~$428 used, the RTX 3060 is no longer the bargain it once was, but it remains the lowest-cost entry point for 12 GB of VRAM on NVIDIA’s CUDA stack.
TDP: 170W. A 550W PSU handles it easily.
Tier 2: Sweet Spot — Used RTX 3090 (~$1,730)
The RTX 3090 (Used) gives you 24 GB of VRAM — the magic number that unlocks 30B+ models at Q4 and even 70B models at Q2 (not great quality, but functional). The trade-off is power: 350W TDP means you need a 750W+ PSU, and your system will pull 450–500W under load. Used prices have climbed to ~$1,730, driven by AI demand, but this is still the most capable consumer-class GPU for local inference.
The RTX 4060 Ti 16GB was previously an attractive alternative at ~$450 new with 16 GB of VRAM and a 165W TDP. It is currently unavailable at most retailers. If you can find one in stock, it remains a solid choice for 13B models — but check pricing carefully, as street prices may have shifted.
For a full comparison, see RTX 3090 vs 4090 for LLMs.
Tier 3: The Oddball — Used Tesla P40 (~$400)
The Tesla P40 (Used) is 24 GB of VRAM for ~$400. That sounds incredible until you learn the caveats: it is a passive-cooled datacenter card (you need to add a fan or 3D-printed shroud), it has no video output (headless only), it runs on PCIe 3.0 with ~346 GB/s bandwidth (slow token generation), and it has no FP16 support (inference runs in FP32 or INT8 only, with framework workarounds required).
If you have a workstation chassis with good airflow and you are comfortable solving cooling and driver issues, the P40 gets you into the 24 GB VRAM class for a fraction of the cost. But this is an enthusiast path, not a beginner one.
What About AMD?
AMD GPUs work with local LLMs via ROCm, but the software ecosystem is still behind NVIDIA’s CUDA stack. Ollama supports ROCm, and llama.cpp has ROCm backends, but you will encounter more friction — driver issues, slower community support, and lower tok/s at equivalent VRAM. Read NVIDIA vs AMD for LLMs for the full picture. For a budget AI build where you want things to just work, NVIDIA is the safer bet.
The Build: Desktop vs Mini PC
Desktop: The Right Choice for GPU Inference
A desktop (or tower workstation) is the only practical option if you want a discrete GPU. You need a PCIe x16 slot, adequate PSU wattage, and enough case volume for the GPU’s cooler.
The good news: the rest of the system is cheap. LLM inference is not CPU-intensive. The GPU does the heavy lifting. A basic build looks like:
| Component | Spec | Approximate Cost |
|---|---|---|
| CPU | Any modern quad-core (Intel i3-12100, Ryzen 5 5600) | ~$80–120 used |
| Motherboard | Any ATX board with PCIe x16 | ~$60–80 used |
| RAM | 32 GB DDR4 (for CPU offloading headroom) | ~$50 |
| PSU | 550W–750W depending on GPU | ~$60–80 |
| Storage | 500GB SATA SSD (models load from disk) | ~$30 |
| Case | Any ATX mid-tower | ~$40–50 |
| GPU | See tiers above | ~$428–1,730 |
Total system cost (excluding GPU): ~$320–410. Pair that with a used RTX 3060 and you are running local AI for under $750 total.
You do not need 64 GB of RAM unless you plan to do heavy CPU offloading of layers that don’t fit in VRAM. 32 GB is sufficient for most setups. You do not need NVMe — models load once into VRAM and stay there. A SATA SSD is fine.
Mini PC: CPU-Only Inference
Mini PCs like the Beelink SER9 are excellent home lab machines, but they cannot run discrete GPUs. You are limited to CPU inference via llama.cpp or Ollama’s CPU mode.
CPU inference on a modern 8-core chip yields roughly 5–10 tok/s on a 7B Q4 model — usable for light experimentation, but slow enough that you will not want to use it as your daily AI assistant. For anything beyond testing, you need a GPU.
That said, if you already own a mini PC, try Ollama on it before buying GPU hardware. It might be enough for your use case, and it costs nothing to test. Check best mini PC for home server for options that double as general-purpose home servers.
Power Budget
GPU inference is power-hungry compared to the rest of your home lab. A mini PC running Proxmox draws 10–15W. An AI inference rig draws 200–500W under load.
GPU Power Draw by Tier
| GPU | TDP | Typical Inference Draw | System Total (Load) |
|---|---|---|---|
| RTX 3060 12GB | 170W | ~120–150W | ~200–250W |
| RTX 4060 Ti 16GB | 165W | ~110–140W | ~200–240W |
| RTX 3090 | 350W | ~250–300W | ~400–500W |
| Tesla P40 | 250W | ~180–220W | ~300–380W |
PSU Sizing
Buy a PSU with at least 150W of headroom above your expected peak system draw. For an RTX 3060 or 4060 Ti build, a 550W unit is comfortable. For an RTX 3090, go 750W minimum. An 80 Plus Gold certified PSU saves electricity over time compared to cheaper bronze or unrated units.
UPS Protection
A GPU running inference is doing real work — generating text, processing requests, possibly serving a multi-user Open WebUI instance. A power outage during sustained GPU load can corrupt your OS drive or model files.
The CyberPower CP1500PFCLCD is a pure sine wave UPS that handles 1000W continuous load, enough for any single-GPU AI rig. At ~$230, it is cheap insurance. For help choosing the right capacity, see how to size a UPS and best UPS for home lab.
Electricity Cost
At $0.15/kWh (US average), running an RTX 3060 system 24/7 at average 120W costs roughly $13/month. An RTX 3090 system at average 200W costs roughly $22/month. During idle (no inference running), both drop to $7–12/month depending on your system’s idle draw.
These numbers matter when comparing to cloud API costs. If you use local AI for more than 3–4 hours per day, owning the hardware pays for itself within 12–18 months versus API pricing.
Software Stack
The software side has gotten remarkably simple. Two tools cover 90% of home lab AI use cases.
Ollama — The Engine
Ollama is the Docker of local LLMs. Install it, run ollama pull llama3.1:8b, and you have a working model in minutes. It handles model downloads, quantization selection, GPU detection, and exposes a local API compatible with OpenAI’s format.
Key models to start with:
- Llama 3.1 8B — Meta’s general-purpose model, excellent at instruction following
- Qwen 3 8B — Strong reasoning and multilingual support from Alibaba
- Mistral 7B — Fast and capable, good for quick tasks
- DeepSeek-R1 8B (distill) — Reasoning-focused, surprisingly good at math and code
- Qwen 2.5 Coder 7B — Purpose-built for code generation and completion
All of these run comfortably on 12 GB of VRAM. With 24 GB, you can step up to their 32B and 34B variants for noticeably better quality.
Install Ollama on Linux:
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.1:8b
ollama run llama3.1:8b
Open WebUI — The Interface
Open WebUI gives you a ChatGPT-style web interface that talks to your local Ollama instance. It supports conversation history, multiple models, document upload (RAG), and user accounts if you want to share your setup with family or coworkers.
Run it with Docker:
docker run -d -p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
ghcr.io/open-webui/open-webui:main
This connects to Ollama running on the host machine. Open http://localhost:3000 and you have a private, self-hosted AI assistant.
Other Tools Worth Knowing
- text-generation-webui (oobabooga) — More complex than Ollama but offers fine-grained control over model parameters, sampling settings, and extensions. Good for experimentation.
- vLLM — High-throughput serving engine. Overkill for personal use, but relevant if you are serving models to multiple users or building an application layer.
- LocalAI — OpenAI API-compatible server that supports multiple backends. Useful if you want a drop-in replacement for OpenAI’s API in your existing tools.
For most home labs, Ollama plus Open WebUI is the entire stack.
What Can You Actually Run?
This is the practical question. Here is what each VRAM tier gets you, using Q4_K_M quantization (the sweet spot for quality vs. size):
| VRAM | Models That Fit | Approximate tok/s | Quality Level |
|---|---|---|---|
| 8 GB | 7B Q4, short context only | 40–60 | Decent for simple tasks |
| 12 GB | 7B Q8, 13B Q4, 8B with 16K context | 18–52 | Good for daily use |
| 16 GB | 13B Q8, 14B Q4 with long context | 14–89 | Strong general-purpose |
| 24 GB | 32B Q4, 34B Q4, 70B Q2 (low quality) | 10–112 | Excellent for most tasks |
The quality jump from 7B to 13B–14B is significant — noticeably better reasoning, fewer hallucinations, and more coherent long-form output. The jump from 13B to 32B–34B is another clear step up, particularly for code generation and complex instructions.
If your budget forces a choice, prioritize VRAM over tok/s. A slower model that fits entirely in VRAM will always beat a faster model that partially offloads to CPU.
Common Mistakes
Buying 8 GB of VRAM. The RTX 4060 (8GB, ~$500) and similar cards feel like a deal, but 8 GB locks you into 7B models with short context. You will outgrow it within weeks. Spend the extra money for 12 GB minimum.
Ignoring power costs. An RTX 3090 system running 24/7 costs $20–25/month in electricity. Over two years, that is $500+ in electricity on top of the hardware cost. Factor this in when comparing to cloud APIs or a lower-power GPU.
Buying a second GPU instead of a bigger one. Multi-GPU inference exists, but it is slower and more complex than single-GPU inference for LLMs. Two 12 GB cards do not equal one 24 GB card. Read best GPU for local LLMs — the consistent advice is to buy the single largest GPU you can afford.
Over-speccing the CPU and RAM. A $400 CPU does nothing for inference speed. The GPU handles the compute. Put that money toward more VRAM instead. A $100 used quad-core with 32 GB of DDR4 is more than enough.
Skipping the UPS. GPU inference rigs draw significant power and are sensitive to sudden shutoffs. A $230 UPS protects $750–2,100 worth of hardware. This is not optional.
Starting with the Tesla P40. The P40 is a great second GPU purchase for someone who understands Linux driver management, passive cooling solutions, and headless server operation. It is a terrible first AI build for someone who just wants to run Ollama.
Budget Builds at a Glance
The $750 Starter
- Used desktop (i3/i5 + 16GB RAM + 500GB SSD): ~$200
- Used RTX 3060 12GB: ~$428
- 550W PSU (if needed): ~$60
Runs: 7B–8B models at full speed, 13B at Q4. Good enough to replace basic ChatGPT usage for private, unlimited conversations. GPU prices have risen significantly — this is no longer a sub-$500 build, but it remains the cheapest discrete-GPU path into local AI.
The $2,100 Sweet Spot
- Used desktop or new budget build: ~$350
- Used RTX 3090 24GB: ~$1,730
- 750W PSU: ~$80
Runs: Everything up to 32B–34B models at Q4. This is where local AI becomes genuinely competitive with cloud APIs for quality. The 30B class models handle code generation, analysis, and creative writing at a level that feels close to GPT-4 for many tasks. The RTX 3090 has gotten expensive on the used market, but no other consumer card matches its 24 GB of VRAM at this price point.
The $2,300 Complete Setup
- The $2,100 build above, plus:
- CyberPower CP1500 UPS: ~$230
Add power protection and you have a complete, production-ready home AI inference server.
Wrap-Up
The minimum viable AI home lab is a used desktop, a used GPU with 12+ GB of VRAM, and Ollama. Total cost: around $750. That gets you private, unlimited, uncensored AI inference that runs entirely on your hardware.
If you can stretch to ~$2,100, a used RTX 3090 unlocks 32B+ models that are genuinely useful for serious work. That is the build most people should aim for, even if it means saving up longer than it would have a year ago — GPU prices have risen sharply due to AI demand.
Start with how much VRAM you actually need, pick a GPU tier, and build around it. The rest of the hardware is commodity. The GPU is the build.
For a broader home lab starting point that covers networking, storage, and compute beyond AI, see home lab starter guide.
Frequently Asked Questions
What is the cheapest way to run AI models locally?
Can I run local LLMs on a mini PC?
How much does it cost to run a home AI lab 24/7?
Is 8GB of VRAM enough for local AI?
Get our weekly picks
The best home lab deals and new reviews, every week. Free, no spam.
Join home lab builders who get deals first.