What is the cheapest way to run AI models locally?

A used RTX 3060 12GB (~$428) in a basic desktop with a 550W PSU. This gets you 12 GB of VRAM, enough to run 7B–8B models at full speed and 13B models with moderate quantization. Total system cost is around $700–800 if you buy used components.

Can I run local LLMs on a mini PC?

Only with CPU inference, which is 5–15x slower than GPU inference. Mini PCs lack PCIe slots for discrete GPUs. They work for testing small models (1B–3B) but are not practical for daily use with larger models. You need a desktop or workstation with a PCIe x16 slot for serious local AI.

How much does it cost to run a home AI lab 24/7?

An RTX 3060 system at idle draws about 80W, costing roughly $7–10/month at average US electricity rates. Under full inference load (250W system total), that rises to $20–25/month. An RTX 3090 system idles at 100–120W and peaks around 450–500W total, costing $10–15/month at idle or $35–45/month under sustained load.

Is 8GB of VRAM enough for local AI?

Barely. You can run 7B models at Q4 quantization with short context lengths, but you will hit the ceiling quickly. 12 GB is the realistic minimum, and 16 GB gives you breathing room for 13B models and longer contexts. See our VRAM guide for the full breakdown.

How to Build a Home Lab for AI on a Budget

Running AI models locally stopped being a niche hobby sometime in 2025. Models got smaller and faster. Quantization got better. Ollama made the software side trivial. The only question left is hardware — and hardware is where people waste money.

This guide is about building the minimum viable AI inference setup: the cheapest path to running useful local LLMs at acceptable speed, without buying hardware you don’t need.

If you’re already running a home lab, you may have half the parts. If you’re starting from zero, you can have a working local AI setup for under $800.

GPU: The Only Component That Matters

For local LLM inference, the GPU is the build. Everything else — CPU, RAM, case — is commodity hardware that exists to feed the GPU. The two numbers that matter are VRAM (how big a model you can load) and memory bandwidth (how fast it generates tokens).

If you haven’t read How Much VRAM Do You Need for Local LLMs?, start there. It covers the math in detail. The short version:

7B–8B models (Llama 3.1 8B, Mistral 7B, Qwen 3 8B): Need ~5–6 GB VRAM at Q4 quantization
13B–14B models (Qwen 2.5 14B): Need ~8–10 GB VRAM at Q4
30B–34B models (Qwen 2.5 32B, DeepSeek-R1 32B): Need ~20–22 GB VRAM at Q4
70B models (Llama 3.1 70B): Need ~40 GB VRAM at Q4 — no single consumer GPU can do this

That makes the budget decision straightforward. Here are the tiers:

Tier 1: Entry — Used RTX 3060 12GB (~$428)

The NVIDIA RTX 3060 12GB (Used) is still the cheapest way into GPU-accelerated inference with 12 GB of VRAM. It runs 7B–8B models comfortably at Q4 or Q8 quantization, and 13B models at Q4 with shorter context windows. Token generation speed is reasonable: roughly 50 tok/s on 7B Q4 and 18 tok/s on 13B Q4.

The ceiling is real. You cannot run anything larger than 13B without heavy CPU offloading, which defeats the purpose. At ~$428 used, the RTX 3060 is no longer the bargain it once was, but it remains the lowest-cost entry point for 12 GB of VRAM on NVIDIA’s CUDA stack.

TDP: 170W. A 550W PSU handles it easily.

Tier 2: Sweet Spot — Used RTX 3090 (~$1,730)

The RTX 3090 (Used) gives you 24 GB of VRAM — the magic number that unlocks 30B+ models at Q4 and even 70B models at Q2 (not great quality, but functional). The trade-off is power: 350W TDP means you need a 750W+ PSU, and your system will pull 450–500W under load. Used prices have climbed to ~$1,730, driven by AI demand, but this is still the most capable consumer-class GPU for local inference.

The RTX 4060 Ti 16GB was previously an attractive alternative at ~$450 new with 16 GB of VRAM and a 165W TDP. It is currently unavailable at most retailers. If you can find one in stock, it remains a solid choice for 13B models — but check pricing carefully, as street prices may have shifted.

For a full comparison, see RTX 3090 vs 4090 for LLMs.

Tier 3: The Oddball — Used Tesla P40 (~$400)

The Tesla P40 (Used) is 24 GB of VRAM for ~$400. That sounds incredible until you learn the caveats: it is a passive-cooled datacenter card (you need to add a fan or 3D-printed shroud), it has no video output (headless only), it runs on PCIe 3.0 with ~346 GB/s bandwidth (slow token generation), and it has no FP16 support (inference runs in FP32 or INT8 only, with framework workarounds required).

If you have a workstation chassis with good airflow and you are comfortable solving cooling and driver issues, the P40 gets you into the 24 GB VRAM class for a fraction of the cost. But this is an enthusiast path, not a beginner one.

What About AMD?

AMD GPUs work with local LLMs via ROCm, but the software ecosystem is still behind NVIDIA’s CUDA stack. Ollama supports ROCm, and llama.cpp has ROCm backends, but you will encounter more friction — driver issues, slower community support, and lower tok/s at equivalent VRAM. Read NVIDIA vs AMD for LLMs for the full picture. For a budget AI build where you want things to just work, NVIDIA is the safer bet.

The Build: Desktop vs Mini PC

Desktop: The Right Choice for GPU Inference

A desktop (or tower workstation) is the only practical option if you want a discrete GPU. You need a PCIe x16 slot, adequate PSU wattage, and enough case volume for the GPU’s cooler.

The good news: the rest of the system is cheap. LLM inference is not CPU-intensive. The GPU does the heavy lifting. A basic build looks like:

Component	Spec	Approximate Cost
CPU	Any modern quad-core (Intel i3-12100, Ryzen 5 5600)	~$80–120 used
Motherboard	Any ATX board with PCIe x16	~$60–80 used
RAM	32 GB DDR4 (for CPU offloading headroom)	~$50
PSU	550W–750W depending on GPU	~$60–80
Storage	500GB SATA SSD (models load from disk)	~$30
Case	Any ATX mid-tower	~$40–50
GPU	See tiers above	~$428–1,730

Total system cost (excluding GPU): ~$320–410. Pair that with a used RTX 3060 and you are running local AI for under $750 total.

You do not need 64 GB of RAM unless you plan to do heavy CPU offloading of layers that don’t fit in VRAM. 32 GB is sufficient for most setups. You do not need NVMe — models load once into VRAM and stay there. A SATA SSD is fine.

Mini PC: CPU-Only Inference

Mini PCs like the Beelink SER9 are excellent home lab machines, but they cannot run discrete GPUs. You are limited to CPU inference via llama.cpp or Ollama’s CPU mode.

CPU inference on a modern 8-core chip yields roughly 5–10 tok/s on a 7B Q4 model — usable for light experimentation, but slow enough that you will not want to use it as your daily AI assistant. For anything beyond testing, you need a GPU.

That said, if you already own a mini PC, try Ollama on it before buying GPU hardware. It might be enough for your use case, and it costs nothing to test. Check best mini PC for home server for options that double as general-purpose home servers.

Power Budget

GPU inference is power-hungry compared to the rest of your home lab. A mini PC running Proxmox draws 10–15W. An AI inference rig draws 200–500W under load.

GPU Power Draw by Tier

GPU	TDP	Typical Inference Draw	System Total (Load)
RTX 3060 12GB	170W	~120–150W	~200–250W
RTX 4060 Ti 16GB	165W	~110–140W	~200–240W
RTX 3090	350W	~250–300W	~400–500W
Tesla P40	250W	~180–220W	~300–380W

PSU Sizing

Buy a PSU with at least 150W of headroom above your expected peak system draw. For an RTX 3060 or 4060 Ti build, a 550W unit is comfortable. For an RTX 3090, go 750W minimum. An 80 Plus Gold certified PSU saves electricity over time compared to cheaper bronze or unrated units.

UPS Protection

A GPU running inference is doing real work — generating text, processing requests, possibly serving a multi-user Open WebUI instance. A power outage during sustained GPU load can corrupt your OS drive or model files.

The CyberPower CP1500PFCLCD is a pure sine wave UPS that handles 1000W continuous load, enough for any single-GPU AI rig. At ~$230, it is cheap insurance. For help choosing the right capacity, see how to size a UPS and best UPS for home lab.

Electricity Cost

At $0.15/kWh (US average), running an RTX 3060 system 24/7 at average 120W costs roughly $13/month. An RTX 3090 system at average 200W costs roughly $22/month. During idle (no inference running), both drop to $7–12/month depending on your system’s idle draw.

These numbers matter when comparing to cloud API costs. If you use local AI for more than 3–4 hours per day, owning the hardware pays for itself within 12–18 months versus API pricing.

Software Stack

The software side has gotten remarkably simple. Two tools cover 90% of home lab AI use cases.

Ollama — The Engine

Ollama is the Docker of local LLMs. Install it, run ollama pull llama3.1:8b, and you have a working model in minutes. It handles model downloads, quantization selection, GPU detection, and exposes a local API compatible with OpenAI’s format.

Key models to start with:

Llama 3.1 8B — Meta’s general-purpose model, excellent at instruction following
Qwen 3 8B — Strong reasoning and multilingual support from Alibaba
Mistral 7B — Fast and capable, good for quick tasks
DeepSeek-R1 8B (distill) — Reasoning-focused, surprisingly good at math and code
Qwen 2.5 Coder 7B — Purpose-built for code generation and completion

All of these run comfortably on 12 GB of VRAM. With 24 GB, you can step up to their 32B and 34B variants for noticeably better quality.

Install Ollama on Linux:

curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.1:8b
ollama run llama3.1:8b

Open WebUI — The Interface

Open WebUI gives you a ChatGPT-style web interface that talks to your local Ollama instance. It supports conversation history, multiple models, document upload (RAG), and user accounts if you want to share your setup with family or coworkers.

Run it with Docker:

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

This connects to Ollama running on the host machine. Open http://localhost:3000 and you have a private, self-hosted AI assistant.

Other Tools Worth Knowing

text-generation-webui (oobabooga) — More complex than Ollama but offers fine-grained control over model parameters, sampling settings, and extensions. Good for experimentation.
vLLM — High-throughput serving engine. Overkill for personal use, but relevant if you are serving models to multiple users or building an application layer.
LocalAI — OpenAI API-compatible server that supports multiple backends. Useful if you want a drop-in replacement for OpenAI’s API in your existing tools.

For most home labs, Ollama plus Open WebUI is the entire stack.

What Can You Actually Run?

This is the practical question. Here is what each VRAM tier gets you, using Q4_K_M quantization (the sweet spot for quality vs. size):

VRAM	Models That Fit	Approximate tok/s	Quality Level
8 GB	7B Q4, short context only	40–60	Decent for simple tasks
12 GB	7B Q8, 13B Q4, 8B with 16K context	18–52	Good for daily use
16 GB	13B Q8, 14B Q4 with long context	14–89	Strong general-purpose
24 GB	32B Q4, 34B Q4, 70B Q2 (low quality)	10–112	Excellent for most tasks

The quality jump from 7B to 13B–14B is significant — noticeably better reasoning, fewer hallucinations, and more coherent long-form output. The jump from 13B to 32B–34B is another clear step up, particularly for code generation and complex instructions.

If your budget forces a choice, prioritize VRAM over tok/s. A slower model that fits entirely in VRAM will always beat a faster model that partially offloads to CPU.

Common Mistakes

Buying 8 GB of VRAM. The RTX 4060 (8GB, ~$500) and similar cards feel like a deal, but 8 GB locks you into 7B models with short context. You will outgrow it within weeks. Spend the extra money for 12 GB minimum.

Ignoring power costs. An RTX 3090 system running 24/7 costs $20–25/month in electricity. Over two years, that is $500+ in electricity on top of the hardware cost. Factor this in when comparing to cloud APIs or a lower-power GPU.

Buying a second GPU instead of a bigger one. Multi-GPU inference exists, but it is slower and more complex than single-GPU inference for LLMs. Two 12 GB cards do not equal one 24 GB card. Read best GPU for local LLMs — the consistent advice is to buy the single largest GPU you can afford.

Over-speccing the CPU and RAM. A $400 CPU does nothing for inference speed. The GPU handles the compute. Put that money toward more VRAM instead. A $100 used quad-core with 32 GB of DDR4 is more than enough.

Skipping the UPS. GPU inference rigs draw significant power and are sensitive to sudden shutoffs. A $230 UPS protects $750–2,100 worth of hardware. This is not optional.

Starting with the Tesla P40. The P40 is a great second GPU purchase for someone who understands Linux driver management, passive cooling solutions, and headless server operation. It is a terrible first AI build for someone who just wants to run Ollama.

Budget Builds at a Glance

The $750 Starter

Used desktop (i3/i5 + 16GB RAM + 500GB SSD): ~$200
Used RTX 3060 12GB: ~$428
550W PSU (if needed): ~$60

Runs: 7B–8B models at full speed, 13B at Q4. Good enough to replace basic ChatGPT usage for private, unlimited conversations. GPU prices have risen significantly — this is no longer a sub-$500 build, but it remains the cheapest discrete-GPU path into local AI.

The $2,100 Sweet Spot

Used desktop or new budget build: ~$350
Used RTX 3090 24GB: ~$1,730
750W PSU: ~$80

Runs: Everything up to 32B–34B models at Q4. This is where local AI becomes genuinely competitive with cloud APIs for quality. The 30B class models handle code generation, analysis, and creative writing at a level that feels close to GPT-4 for many tasks. The RTX 3090 has gotten expensive on the used market, but no other consumer card matches its 24 GB of VRAM at this price point.

The $2,300 Complete Setup

The $2,100 build above, plus:
CyberPower CP1500 UPS: ~$230

Add power protection and you have a complete, production-ready home AI inference server.

Wrap-Up

The minimum viable AI home lab is a used desktop, a used GPU with 12+ GB of VRAM, and Ollama. Total cost: around $750. That gets you private, unlimited, uncensored AI inference that runs entirely on your hardware.

If you can stretch to ~$2,100, a used RTX 3090 unlocks 32B+ models that are genuinely useful for serious work. That is the build most people should aim for, even if it means saving up longer than it would have a year ago — GPU prices have risen sharply due to AI demand.

Start with how much VRAM you actually need, pick a GPU tier, and build around it. The rest of the hardware is commodity. The GPU is the build.

For a broader home lab starting point that covers networking, storage, and compute beyond AI, see home lab starter guide.