Best Mini PC for Local AI and LLM Inference in 2026

Published March 15, 2026 · Updated March 15, 2026 · 11 min read

Our Pick

Minisforum MS-S1 Max

~$3,040

Strix Halo APU with 128GB unified memory and Radeon 8060S iGPU. The only mini PC that handles 70B quantized models.

	★ Minisforum MS-S1 Max Our Pick	Beelink SER9 Pro Best Value	Minisforum UM890 Pro	Beelink ME Mini	Beelink Mini S12 Pro Budget Pick
CPU	Ryzen AI Max+ 395 (16C/32T)	Ryzen AI 9 HX 370 (12C/24T)	Ryzen 9 8945HS (8C/16T)	Intel N150 (4C/4T)	Intel N100 (4C/4T)
RAM	128 GB LPDDR5X-8000	32 GB LPDDR5X-8000	32 GB DDR5-5600	12 GB LPDDR5	16 GB DDR5-4800
iGPU	Radeon 8060S (40 CU)	Radeon 890M (16 CU)	Radeon 780M (12 CU)	Intel UHD	Intel UHD
NPU (TOPS)	50 TOPS	50 TOPS	16 TOPS	None	None
Memory BW	~256 GB/s	~128 GB/s	~90 GB/s	~51 GB/s	~38 GB/s
Price	~$3,040	~$729	~$855	~$399	~$170
	Check Price →	Check Price →	Check Price →	Check Price →	Check Price →

Let me be direct: a mini PC will not replace an RTX 4090 or a Mac Studio for serious LLM inference. But if you want to run 7B-13B models locally — for private chat, code completion, document summarization — the current generation of AMD APUs makes this genuinely practical in a box that draws 25-65W and fits on your desk.

The landscape shifted in late 2025 when AMD’s Strix Halo APUs started shipping in mini PCs. A Ryzen AI Max+ 395 with 128 GB of unified memory can load a 70B quantized model entirely in addressable memory and generate tokens at usable speeds. That was science fiction two years ago.

This guide covers five mini PCs at five price points, from budget learning tools to a ~$3,040 AI workstation. I’ll be honest about what each can and cannot do.

What Actually Matters for Local AI on a Mini PC

Before the picks, here is the hierarchy of what matters for LLM inference on mini PC hardware. Most buyers get this wrong.

Memory capacity is king. The model has to fit in RAM. A 7B Q4 model needs ~5 GB. A 13B Q4 needs ~10 GB. A 70B Q4 needs ~40 GB. If the model doesn’t fit, it doesn’t run — or it swaps to disk and becomes unusably slow. Buy the most RAM you can afford.

Memory bandwidth determines speed. LLM token generation is memory-bandwidth-bound, not compute-bound. DDR5-5600 delivers roughly 90 GB/s in dual-channel, which translates to ~12 tok/s on a 7B Q4 model via CPU inference. LPDDR5X-8000 pushes ~256 GB/s in quad-channel, nearly tripling that. Faster memory means faster tokens.

iGPU compute matters more than NPU. The Radeon 890M and 780M integrated GPUs can accelerate inference via Vulkan or ROCm offloading in llama.cpp. Real-world improvement: 50-100% faster token generation compared to CPU-only on the same chip. The NPU, despite AMD and Intel marketing, does not accelerate LLM inference in any mainstream framework as of March 2026.

NPU is a future bet, not a current feature. Ollama, llama.cpp, vLLM, and LM Studio do not offload to the NPU. The 50 TOPS NPU in a Ryzen AI 9 HX 370 accelerates Windows Copilot features and video calls. For LLM workloads, it sits idle. This will likely change, but buying based on NPU TOPS today is buying based on promises.

Our Pick: Minisforum MS-S1 Max

The Minisforum MS-S1 Max is the first mini PC I have tested that can run models a GPU server would usually handle. The Ryzen AI Max+ 395 “Strix Halo” APU is in a different class from everything else on this list.

CPU: AMD Ryzen AI Max+ 395, 16 cores / 32 threads, Zen 5 RAM: 128 GB LPDDR5X-8000 (quad-channel, soldered) iGPU: Radeon 8060S, 40 CUs, RDNA 3.5 NPU: XDNA 2 — 50 TOPS Storage: 2x M.2 PCIe 5.0 NVMe + internal half-height PCIe slot Networking: 2x 10GbE RJ-45 + USB4 v2 TDP: 110–160W (four configurable power modes) Price: ~$3,040

The headline number is 128 GB of unified memory. Because Strix Halo uses an APU architecture with shared memory, the Radeon 8060S iGPU can address all 128 GB — not just a 8 or 16 GB VRAM partition. This means you can load a Llama 3 70B model at Q4 quantization (~40 GB) and still have 80+ GB free for the OS, context window, and other workloads. On a 7B Q4 model, the iGPU delivers 15-20 tok/s via Vulkan offloading in llama.cpp. On a 13B model, expect 10-15 tok/s. On 70B Q4, you are looking at 5-8 tok/s — slow by GPU server standards but functional for single-user chat.

The dual 10GbE ports are not just for show. If you run this as a shared inference endpoint on your home network — Ollama with an API exposed to other machines — 10GbE means multiple clients can query simultaneously without network bottleneck. The USB4 v2 ports provide an eGPU path if you eventually want to add a discrete card.

The trade-offs are real. At ~$3,040, this costs more than a capable desktop with an RTX 3060 12GB, which would outperform it on 7B-13B models. The LPDDR5X is soldered, so 128 GB is both the floor and the ceiling. ROCm support for RDNA 3.5 integrated graphics is still being stabilized — most users will get the best results with Vulkan offloading in llama.cpp rather than full ROCm acceleration. And the built-in 320W PSU driving a 160W TDP means fan noise is noticeable under sustained load.

Who should buy it: anyone who wants to run large models (30B-70B) in a compact form factor and values the unified memory architecture over raw GPU compute speed. This is a legitimate AI development workstation in a mini PC chassis. For GPU-first inference, see our best GPU for local LLMs guide instead.

Best Value: Beelink SER9 Pro

The Beelink SER9 Pro is where most home lab builders should start for local AI. At ~$729, it delivers the latest Zen 5 silicon with a genuinely capable iGPU for inference — at a quarter of the MS-S1 Max price.

CPU: AMD Ryzen AI 9 HX 370, 12 cores / 24 threads, Zen 5 RAM: 32 GB LPDDR5X-8000 (soldered) iGPU: Radeon 890M, 16 CUs, RDNA 3.5 NPU: XDNA 2 — 50 TOPS Storage: 2x M.2 2280 PCIe 4.0 TDP: 28–65W Price: ~$729

The Ryzen AI 9 HX 370 is the sweet spot of AMD’s current AI-focused mobile lineup. Twelve Zen 5 cores provide fast CPU inference, and the Radeon 890M with 16 compute units offers meaningful iGPU acceleration. On Llama 3 8B at Q4_K_M quantization, expect 18-25 tok/s with iGPU offloading — fast enough that responses feel conversational rather than painful.

The 32 GB RAM ceiling is the hard constraint. You can run 7B models with plenty of headroom and 13B models with a tight but workable margin. Anything larger — 30B, 70B — simply will not fit. If you know you need larger models, the UM890 Pro (upgradeable to 96 GB) or the MS-S1 Max (128 GB) are the only paths forward in the mini PC form factor.

LPDDR5X-8000 memory provides strong bandwidth for inference. At ~128 GB/s in dual-channel, it is meaningfully faster than the DDR5-5600 in the UM890 Pro. That bandwidth advantage translates directly to faster token generation — roughly 20-30% faster on identical models and quantization levels.

The SER9 Pro also happens to be an excellent home server — the 12 Zen 5 cores handle Proxmox VMs and Docker containers comfortably alongside occasional AI inference workloads. At 65W max TDP, power consumption stays reasonable for a machine you might run 24/7.

Upgradeable RAM: Minisforum UM890 Pro

The Minisforum UM890 Pro occupies a unique position in this guide: it is the only mini PC here with user-upgradeable RAM. That makes it the cheapest path to running 70B quantized models on a mini PC — if you are willing to accept slow CPU-only inference.

CPU: AMD Ryzen 9 8945HS, 8 cores / 16 threads, Zen 4 RAM: 32 GB DDR5-5600 SO-DIMM (two slots, upgradeable to 96 GB) iGPU: Radeon 780M, 12 CUs, RDNA 3 NPU: XDNA 1 — 16 TOPS Storage: 2x M.2 2280 PCIe 4.0 Networking: 1x 2.5GbE + 1x 1GbE + WiFi 6E TDP: 45–70W Price: ~$855 (32 GB configuration)

Buy the 32 GB configuration for ~$855 and run 7B-13B models at 12-18 tok/s with Radeon 780M iGPU offloading. When you are ready for larger models, swap in two 48 GB SO-DIMMs for 96 GB total and load a 70B Q4 model for pure CPU inference. That CPU inference will run at 2-4 tok/s — genuinely slow, but functional if you are running batch processing, automated summarization, or API queries where latency tolerance is higher.

The OCuLink port is worth mentioning for AI use cases. You can connect an external GPU enclosure with a desktop-class GPU — an RTX 3060 12GB or RTX 4060 Ti 16GB — and get real GPU-accelerated inference without building a full desktop. OCuLink delivers PCIe 4.0 x4 bandwidth, which is sufficient for inference workloads (unlike training, which needs more bandwidth). This gives the UM890 Pro the best upgrade path of any machine in this guide.

The Radeon 780M is a generation behind the 890M in the SER9 Pro. With 12 CUs versus 16, and RDNA 3 versus 3.5, it is roughly 30% slower for iGPU-accelerated inference at the same model size. The 16 TOPS NPU (XDNA 1) is the oldest NPU architecture here and is even less likely to gain framework support than the newer XDNA 2.

The UM890 Pro is the right choice if you want flexibility — upgradeable RAM, OCuLink eGPU support, and an upgrade path to 96 GB for future workloads.

Budget Option: Beelink Mini S12 Pro (N100)

The Beelink Mini S12 Pro is not a serious AI inference machine — and it is currently unavailable. I included it because at ~$170 it was the cheapest way to learn the local AI toolchain — and learning the toolchain has value even before you invest in faster hardware.

CPU: Intel N100, 4 cores / 4 threads, up to 3.4 GHz RAM: 16 GB DDR5-4800 iGPU: Intel UHD (24 execution units) NPU: None TDP: 6W Price: ~$170

Install Ollama, download a 7B Q4 model, and start experimenting with local inference at 6-9 tok/s. That is roughly one word per second — noticeably slow but fast enough to evaluate model quality, test API integrations, and build applications that you will later deploy on faster hardware. The Intel UHD iGPU has no meaningful compute capability for inference. This is CPU-only.

The 16 GB RAM ceiling means 7B models only. A 13B Q4 model needs ~10 GB, leaving only 6 GB for the OS and runtime — it will work but with heavy memory pressure. Anything larger is a non-starter.

At 6W idle, this machine costs under $8/year to run 24/7 at US average electricity rates. If you want a dedicated Ollama endpoint that is always available for quick questions, code completion, or home automation integrations, the N100 earns its place despite the speed limitations.

What About the Beelink ME Mini?

I researched the Beelink ME Mini as a potential mid-range option, but it does not belong in an AI inference guide. The Intel N150 processor has four cores, 12 GB of LPDDR5 RAM, no NPU, and no meaningful iGPU compute. Its primary design is as a compact NAS with six M.2 slots — a storage-focused machine, not a compute-focused one. Performance would be marginally better than the N100 for inference but at a higher price point, making it a poor value for this specific use case.

If you need a compact storage box that can run lightweight 7B models on the side, the ME Mini works. But for dedicated AI inference, the SER9 Pro at ~$729 is a dramatically better investment.

The NPU Question: Honest Assessment

Every mini PC with a recent AMD or Intel processor now ships with an NPU, and marketing departments want you to believe this matters for local AI. Here is the reality as of March 2026.

What the NPU does today: Accelerates Windows Copilot+ features (live captions, image generation in Paint, Recall), video call background blur, and some image processing pipelines. These are real features that work.

What the NPU does not do today: Accelerate LLM token generation in Ollama, llama.cpp, LM Studio, vLLM, or any other mainstream inference framework. The XDNA 2 architecture in the Ryzen AI 9 HX 370 delivers 50 TOPS — impressive on paper — but no framework currently offloads the transformer attention mechanism or matrix multiplications to it.

Will this change? Probably. AMD is pushing the XDNA SDK and has demonstrated LLM inference on NPU hardware at trade shows. Intel is doing the same with AI Boost. But “demonstrated at a trade show” and “works reliably in production” are different things. I would not pay a premium for NPU TOPS today with the expectation of future LLM support.

What actually accelerates LLM inference on a mini PC: The iGPU (Radeon 890M, 780M) via Vulkan or ROCm offloading in llama.cpp, and fast memory bandwidth (LPDDR5X > DDR5 > DDR4). These work today and deliver measurable speedups.

Realistic Performance Expectations

Here is what you can actually expect from each tier, running Ollama with llama.cpp backend on Llama 3 8B Q4_K_M:

Mini PC	Inference Method	7B Q4 tok/s	13B Q4 tok/s	70B Q4 tok/s
MS-S1 Max (128 GB)	iGPU offload	15-20	10-15	5-8
SER9 Pro (32 GB)	iGPU offload	18-25	10-14	N/A (RAM)
UM890 Pro (96 GB)	CPU + iGPU	12-18	8-12	2-4
N100 (16 GB)	CPU only	6-9	4-6*	N/A (RAM)

*13B on the N100 runs under severe memory pressure and may swap.

For context: an RTX 4090 runs Llama 3 8B Q4 at 100-130 tok/s. A Mac Mini M4 Pro with 48 GB unified memory runs it at 40-60 tok/s. Mini PCs are playing in a different league — the question is whether the league they play in is fast enough for your use case.

For single-user interactive chat, 15+ tok/s feels responsive. For code completion integrations, 10+ tok/s is workable. For batch processing where latency does not matter, even 5 tok/s is acceptable. Below 5 tok/s, you are testing patience.

My Recommendation

If your budget allows it and you want to run large models in a compact form factor, the Minisforum MS-S1 Max with 128 GB unified memory is the clear pick. Nothing else in the mini PC category can load a 70B model and generate tokens at usable speeds.

For most home lab builders, the Beelink SER9 Pro at ~$729 hits the right balance. It runs 7B-13B models at interactive speeds and doubles as a capable home server. The 32 GB RAM ceiling is a real limitation — but for the 7B-13B models that are practical for personal use, it is sufficient.

If upgradeability matters, the Minisforum UM890 Pro with its SO-DIMM slots and OCuLink port gives you a path to 96 GB RAM and eGPU acceleration. It is the most future-proof option at ~$855.

And if you just want to learn — install Ollama, experiment with prompts, build API integrations — an N100 or N150 mini PC is the cheapest way to start — the Beelink Mini S12 Pro is currently unavailable, but check our N100/N150 roundup for alternatives. Slow inference is better than no inference.

For workloads that genuinely need GPU-class performance — fine-tuning, 70B at production speeds, multiple concurrent users — a mini PC is not the right tool. See our best GPU for local LLMs guide for those use cases.

Our Pick

Minisforum MS-S1 Max

~$3,040

CPU: AMD Ryzen AI Max+ 395 (16C/32T, Zen 5)
RAM: 128 GB LPDDR5X-8000 (unified, soldered)
iGPU: Radeon 8060S (40 CU, RDNA 3.5)
NPU: XDNA 2 — 50 TOPS
Storage: 2x M.2 PCIe 5.0 + half-height PCIe slot
Networking: 2x 10GbE + USB4 v2
TDP: 110–160W (configurable)
Price: ~$3,040

The first mini PC that can genuinely run 70B quantized LLMs at usable speeds. 128 GB of unified memory means no VRAM bottleneck — the Radeon 8060S iGPU sees all 128 GB as addressable memory for inference. Dual 10GbE makes it viable as a shared inference server.

128 GB unified memory handles 70B Q4 models — no other mini PC comes close

Radeon 8060S iGPU delivers 15-20 tok/s on 7B models via ROCm/Vulkan

Dual 10GbE for serving models to multiple clients

160W TDP is high for a mini PC but reasonable for what it replaces

~$3,040 puts it in workstation territory

LPDDR5X is soldered — no RAM upgrades

ROCm support for RDNA 3.5 iGPU is still maturing

Fan noise is noticeable at full 160W power draw

Check Price on Amazon →

Best Value

Beelink SER9 Pro

~$729

CPU: AMD Ryzen AI 9 HX 370 (12C/24T, Zen 5)
RAM: 32 GB LPDDR5X-8000 (soldered)
iGPU: Radeon 890M (16 CU, RDNA 3.5)
NPU: XDNA 2 — 50 TOPS
Storage: 2x M.2 2280 PCIe 4.0
Networking: WiFi 6 + 2.5GbE
TDP: 28–65W
Price: ~$729

The sweet spot for local AI on a budget. The Ryzen AI 9 HX 370 with 12 Zen 5 cores and Radeon 890M iGPU handles 7B models at 18-25 tok/s. 32 GB RAM is the ceiling — enough for 7B-13B quantized models but not larger.

18-25 tok/s on 7B Q4 models via iGPU offload

50 TOPS NPU for future framework support

Quiet operation at 32 dB under load

~$729 is accessible for hobbyist AI experimentation

32 GB soldered RAM caps you at 13B models maximum

No RAM upgrade path — what you buy is what you get

iGPU ROCm support requires manual configuration

Single 2.5GbE port limits multi-client serving

Check Price on Amazon →

Minisforum UM890 Pro

~$855

CPU: AMD Ryzen 9 8945HS (8C/16T, Zen 4)
RAM: 32 GB DDR5-5600 (upgradeable to 96 GB)
iGPU: Radeon 780M (12 CU, RDNA 3)
NPU: XDNA 1 — 16 TOPS
Storage: 2x M.2 2280 PCIe 4.0
Networking: 2x 2.5GbE + WiFi 6E
TDP: 45–70W
Price: ~$855

The only mini PC here with upgradeable RAM — max 96 GB via standard SO-DIMM slots. That means you can load 70B Q4 models into memory for CPU inference, trading speed for model size. The Radeon 780M is older but functional for iGPU offload on smaller models.

Upgradeable RAM to 96 GB — can load 70B Q4 models for CPU inference

Dual 2.5GbE networking

OCuLink port for eGPU connectivity

~$855 with 32 GB — a premium over the SER9, but with upgradeable RAM

CPU inference on 70B models is slow (~2-4 tok/s)

16 TOPS NPU is last-gen — less capable than XDNA 2

Radeon 780M iGPU is weaker than 890M for compute

DDR5-5600 bandwidth is lower than LPDDR5X-8000

Check Price on Amazon →

Budget Pick

Beelink Mini S12 Pro

~$170

CPU: Intel N100 (4C/4T, up to 3.4 GHz)
RAM: 16 GB DDR5-4800
iGPU: Intel UHD (24 EU)
NPU: None
Storage: 1x M.2 2280 PCIe 3.0
Networking: 2x 1GbE + WiFi 6
TDP: 6W
Price: ~$170

A $170 test bench for local AI experimentation. The N100 runs 7B Q4 models at 6-9 tok/s via CPU — slow but functional for single-user chat. No NPU, no meaningful iGPU compute. This is about learning the tooling, not production inference.

Currently unavailable — was ~$170, the lowest entry point for local LLM experimentation

6W idle — costs almost nothing to run 24/7

Handles 7B Q4 models at usable (if slow) speeds

Good for learning Ollama, llama.cpp, and LM Studio

6-9 tok/s on 7B is the absolute floor for usability

16 GB RAM limits you to 7B models only

No NPU, no meaningful GPU compute

Single NVMe slot limits storage expansion

Check Price on Amazon →

Frequently Asked Questions

Can a mini PC actually run local LLMs?

Yes, but with realistic expectations. A mini PC with 32 GB RAM and a modern AMD APU runs 7B-13B quantized models at 15-25 tokens per second — fast enough for interactive chat. For 70B models, you need either 96-128 GB RAM (CPU inference at 2-5 tok/s) or a dedicated GPU server. See our GPU guide at /gpu-ai/best-gpu-local-llms/ for GPU-accelerated options.

Does the NPU help with LLM inference?

Not yet in a meaningful way. As of March 2026, the major inference frameworks — Ollama, llama.cpp, LM Studio — do not offload LLM token generation to the NPU. The NPU accelerates Windows Copilot features, image processing, and video calls. This may change as AMD and Intel push XDNA and AI Boost SDKs, but today the NPU is not a buying factor for LLM workloads.

How much RAM do I need for local AI inference?

A 7B model at Q4 quantization needs roughly 4-5 GB of RAM. A 13B Q4 model needs 8-10 GB. A 70B Q4 model needs 35-40 GB. Add 4-8 GB for the OS and inference runtime. So 32 GB handles up to 13B comfortably, 64 GB handles 30B+, and 96-128 GB is needed for 70B models.

Is DDR5 speed important for LLM inference?

Yes — LLM token generation is memory-bandwidth-bound. DDR5-5600 delivers roughly 50% more bandwidth than DDR4-3200, which translates to 40-60% faster inference on the same CPU. LPDDR5X-8000 is faster still. If you are buying a mini PC for AI, DDR5 is non-negotiable.

Should I buy a mini PC or a GPU server for local AI?

If you want to run 7B-13B models for personal use — chat, code completion, summarization — a mini PC is practical and far more power-efficient than a GPU server. If you need 70B+ models at production speeds, multiple concurrent users, or fine-tuning capability, you need a dedicated GPU. A mini PC is a complement to a GPU server, not a replacement.

Get our weekly picks

The best home lab deals and new reviews, every week. Free, no spam.

Join home lab builders who get deals first.