Best Mini PC for Local AI and LLM Inference in 2026
Minisforum MS-S1 Max
~$3,040Strix Halo APU with 128GB unified memory and Radeon 8060S iGPU. The only mini PC that handles 70B quantized models.
| ★ Minisforum MS-S1 Max Our Pick | Beelink SER9 Pro Best Value | Minisforum UM890 Pro | Beelink ME Mini | Beelink Mini S12 Pro Budget Pick | |
|---|---|---|---|---|---|
| CPU | Ryzen AI Max+ 395 (16C/32T) | Ryzen AI 9 HX 370 (12C/24T) | Ryzen 9 8945HS (8C/16T) | Intel N150 (4C/4T) | Intel N100 (4C/4T) |
| RAM | 128 GB LPDDR5X-8000 | 32 GB LPDDR5X-8000 | 32 GB DDR5-5600 | 12 GB LPDDR5 | 16 GB DDR5-4800 |
| iGPU | Radeon 8060S (40 CU) | Radeon 890M (16 CU) | Radeon 780M (12 CU) | Intel UHD | Intel UHD |
| NPU (TOPS) | 50 TOPS | 50 TOPS | 16 TOPS | None | None |
| Memory BW | ~256 GB/s | ~128 GB/s | ~90 GB/s | ~51 GB/s | ~38 GB/s |
| Price | ~$3,040 | ~$729 | ~$855 | ~$399 | ~$170 |
| Check Price → | Check Price → | Check Price → | Check Price → | Check Price → |
Let me be direct: a mini PC will not replace an RTX 4090 or a Mac Studio for serious LLM inference. But if you want to run 7B-13B models locally — for private chat, code completion, document summarization — the current generation of AMD APUs makes this genuinely practical in a box that draws 25-65W and fits on your desk.
The landscape shifted in late 2025 when AMD’s Strix Halo APUs started shipping in mini PCs. A Ryzen AI Max+ 395 with 128 GB of unified memory can load a 70B quantized model entirely in addressable memory and generate tokens at usable speeds. That was science fiction two years ago.
This guide covers five mini PCs at five price points, from budget learning tools to a ~$3,040 AI workstation. I’ll be honest about what each can and cannot do.
What Actually Matters for Local AI on a Mini PC
Before the picks, here is the hierarchy of what matters for LLM inference on mini PC hardware. Most buyers get this wrong.
Memory capacity is king. The model has to fit in RAM. A 7B Q4 model needs ~5 GB. A 13B Q4 needs ~10 GB. A 70B Q4 needs ~40 GB. If the model doesn’t fit, it doesn’t run — or it swaps to disk and becomes unusably slow. Buy the most RAM you can afford.
Memory bandwidth determines speed. LLM token generation is memory-bandwidth-bound, not compute-bound. DDR5-5600 delivers roughly 90 GB/s in dual-channel, which translates to ~12 tok/s on a 7B Q4 model via CPU inference. LPDDR5X-8000 pushes ~256 GB/s in quad-channel, nearly tripling that. Faster memory means faster tokens.
iGPU compute matters more than NPU. The Radeon 890M and 780M integrated GPUs can accelerate inference via Vulkan or ROCm offloading in llama.cpp. Real-world improvement: 50-100% faster token generation compared to CPU-only on the same chip. The NPU, despite AMD and Intel marketing, does not accelerate LLM inference in any mainstream framework as of March 2026.
NPU is a future bet, not a current feature. Ollama, llama.cpp, vLLM, and LM Studio do not offload to the NPU. The 50 TOPS NPU in a Ryzen AI 9 HX 370 accelerates Windows Copilot features and video calls. For LLM workloads, it sits idle. This will likely change, but buying based on NPU TOPS today is buying based on promises.
Our Pick: Minisforum MS-S1 Max
The Minisforum MS-S1 Max is the first mini PC I have tested that can run models a GPU server would usually handle. The Ryzen AI Max+ 395 “Strix Halo” APU is in a different class from everything else on this list.
CPU: AMD Ryzen AI Max+ 395, 16 cores / 32 threads, Zen 5 RAM: 128 GB LPDDR5X-8000 (quad-channel, soldered) iGPU: Radeon 8060S, 40 CUs, RDNA 3.5 NPU: XDNA 2 — 50 TOPS Storage: 2x M.2 PCIe 5.0 NVMe + internal half-height PCIe slot Networking: 2x 10GbE RJ-45 + USB4 v2 TDP: 110–160W (four configurable power modes) Price: ~$3,040
The headline number is 128 GB of unified memory. Because Strix Halo uses an APU architecture with shared memory, the Radeon 8060S iGPU can address all 128 GB — not just a 8 or 16 GB VRAM partition. This means you can load a Llama 3 70B model at Q4 quantization (~40 GB) and still have 80+ GB free for the OS, context window, and other workloads. On a 7B Q4 model, the iGPU delivers 15-20 tok/s via Vulkan offloading in llama.cpp. On a 13B model, expect 10-15 tok/s. On 70B Q4, you are looking at 5-8 tok/s — slow by GPU server standards but functional for single-user chat.
The dual 10GbE ports are not just for show. If you run this as a shared inference endpoint on your home network — Ollama with an API exposed to other machines — 10GbE means multiple clients can query simultaneously without network bottleneck. The USB4 v2 ports provide an eGPU path if you eventually want to add a discrete card.
The trade-offs are real. At ~$3,040, this costs more than a capable desktop with an RTX 3060 12GB, which would outperform it on 7B-13B models. The LPDDR5X is soldered, so 128 GB is both the floor and the ceiling. ROCm support for RDNA 3.5 integrated graphics is still being stabilized — most users will get the best results with Vulkan offloading in llama.cpp rather than full ROCm acceleration. And the built-in 320W PSU driving a 160W TDP means fan noise is noticeable under sustained load.
Who should buy it: anyone who wants to run large models (30B-70B) in a compact form factor and values the unified memory architecture over raw GPU compute speed. This is a legitimate AI development workstation in a mini PC chassis. For GPU-first inference, see our best GPU for local LLMs guide instead.
Best Value: Beelink SER9 Pro
The Beelink SER9 Pro is where most home lab builders should start for local AI. At ~$729, it delivers the latest Zen 5 silicon with a genuinely capable iGPU for inference — at a quarter of the MS-S1 Max price.
CPU: AMD Ryzen AI 9 HX 370, 12 cores / 24 threads, Zen 5 RAM: 32 GB LPDDR5X-8000 (soldered) iGPU: Radeon 890M, 16 CUs, RDNA 3.5 NPU: XDNA 2 — 50 TOPS Storage: 2x M.2 2280 PCIe 4.0 TDP: 28–65W Price: ~$729
The Ryzen AI 9 HX 370 is the sweet spot of AMD’s current AI-focused mobile lineup. Twelve Zen 5 cores provide fast CPU inference, and the Radeon 890M with 16 compute units offers meaningful iGPU acceleration. On Llama 3 8B at Q4_K_M quantization, expect 18-25 tok/s with iGPU offloading — fast enough that responses feel conversational rather than painful.
The 32 GB RAM ceiling is the hard constraint. You can run 7B models with plenty of headroom and 13B models with a tight but workable margin. Anything larger — 30B, 70B — simply will not fit. If you know you need larger models, the UM890 Pro (upgradeable to 96 GB) or the MS-S1 Max (128 GB) are the only paths forward in the mini PC form factor.
LPDDR5X-8000 memory provides strong bandwidth for inference. At ~128 GB/s in dual-channel, it is meaningfully faster than the DDR5-5600 in the UM890 Pro. That bandwidth advantage translates directly to faster token generation — roughly 20-30% faster on identical models and quantization levels.
The SER9 Pro also happens to be an excellent home server — the 12 Zen 5 cores handle Proxmox VMs and Docker containers comfortably alongside occasional AI inference workloads. At 65W max TDP, power consumption stays reasonable for a machine you might run 24/7.
Upgradeable RAM: Minisforum UM890 Pro
The Minisforum UM890 Pro occupies a unique position in this guide: it is the only mini PC here with user-upgradeable RAM. That makes it the cheapest path to running 70B quantized models on a mini PC — if you are willing to accept slow CPU-only inference.
CPU: AMD Ryzen 9 8945HS, 8 cores / 16 threads, Zen 4 RAM: 32 GB DDR5-5600 SO-DIMM (two slots, upgradeable to 96 GB) iGPU: Radeon 780M, 12 CUs, RDNA 3 NPU: XDNA 1 — 16 TOPS Storage: 2x M.2 2280 PCIe 4.0 Networking: 1x 2.5GbE + 1x 1GbE + WiFi 6E TDP: 45–70W Price: ~$855 (32 GB configuration)
Buy the 32 GB configuration for ~$855 and run 7B-13B models at 12-18 tok/s with Radeon 780M iGPU offloading. When you are ready for larger models, swap in two 48 GB SO-DIMMs for 96 GB total and load a 70B Q4 model for pure CPU inference. That CPU inference will run at 2-4 tok/s — genuinely slow, but functional if you are running batch processing, automated summarization, or API queries where latency tolerance is higher.
The OCuLink port is worth mentioning for AI use cases. You can connect an external GPU enclosure with a desktop-class GPU — an RTX 3060 12GB or RTX 4060 Ti 16GB — and get real GPU-accelerated inference without building a full desktop. OCuLink delivers PCIe 4.0 x4 bandwidth, which is sufficient for inference workloads (unlike training, which needs more bandwidth). This gives the UM890 Pro the best upgrade path of any machine in this guide.
The Radeon 780M is a generation behind the 890M in the SER9 Pro. With 12 CUs versus 16, and RDNA 3 versus 3.5, it is roughly 30% slower for iGPU-accelerated inference at the same model size. The 16 TOPS NPU (XDNA 1) is the oldest NPU architecture here and is even less likely to gain framework support than the newer XDNA 2.
The UM890 Pro is the right choice if you want flexibility — upgradeable RAM, OCuLink eGPU support, and an upgrade path to 96 GB for future workloads.
Budget Option: Beelink Mini S12 Pro (N100)
The Beelink Mini S12 Pro is not a serious AI inference machine — and it is currently unavailable. I included it because at ~$170 it was the cheapest way to learn the local AI toolchain — and learning the toolchain has value even before you invest in faster hardware.
CPU: Intel N100, 4 cores / 4 threads, up to 3.4 GHz RAM: 16 GB DDR5-4800 iGPU: Intel UHD (24 execution units) NPU: None TDP: 6W Price: ~$170
Install Ollama, download a 7B Q4 model, and start experimenting with local inference at 6-9 tok/s. That is roughly one word per second — noticeably slow but fast enough to evaluate model quality, test API integrations, and build applications that you will later deploy on faster hardware. The Intel UHD iGPU has no meaningful compute capability for inference. This is CPU-only.
The 16 GB RAM ceiling means 7B models only. A 13B Q4 model needs ~10 GB, leaving only 6 GB for the OS and runtime — it will work but with heavy memory pressure. Anything larger is a non-starter.
At 6W idle, this machine costs under $8/year to run 24/7 at US average electricity rates. If you want a dedicated Ollama endpoint that is always available for quick questions, code completion, or home automation integrations, the N100 earns its place despite the speed limitations.
What About the Beelink ME Mini?
I researched the Beelink ME Mini as a potential mid-range option, but it does not belong in an AI inference guide. The Intel N150 processor has four cores, 12 GB of LPDDR5 RAM, no NPU, and no meaningful iGPU compute. Its primary design is as a compact NAS with six M.2 slots — a storage-focused machine, not a compute-focused one. Performance would be marginally better than the N100 for inference but at a higher price point, making it a poor value for this specific use case.
If you need a compact storage box that can run lightweight 7B models on the side, the ME Mini works. But for dedicated AI inference, the SER9 Pro at ~$729 is a dramatically better investment.
The NPU Question: Honest Assessment
Every mini PC with a recent AMD or Intel processor now ships with an NPU, and marketing departments want you to believe this matters for local AI. Here is the reality as of March 2026.
What the NPU does today: Accelerates Windows Copilot+ features (live captions, image generation in Paint, Recall), video call background blur, and some image processing pipelines. These are real features that work.
What the NPU does not do today: Accelerate LLM token generation in Ollama, llama.cpp, LM Studio, vLLM, or any other mainstream inference framework. The XDNA 2 architecture in the Ryzen AI 9 HX 370 delivers 50 TOPS — impressive on paper — but no framework currently offloads the transformer attention mechanism or matrix multiplications to it.
Will this change? Probably. AMD is pushing the XDNA SDK and has demonstrated LLM inference on NPU hardware at trade shows. Intel is doing the same with AI Boost. But “demonstrated at a trade show” and “works reliably in production” are different things. I would not pay a premium for NPU TOPS today with the expectation of future LLM support.
What actually accelerates LLM inference on a mini PC: The iGPU (Radeon 890M, 780M) via Vulkan or ROCm offloading in llama.cpp, and fast memory bandwidth (LPDDR5X > DDR5 > DDR4). These work today and deliver measurable speedups.
Realistic Performance Expectations
Here is what you can actually expect from each tier, running Ollama with llama.cpp backend on Llama 3 8B Q4_K_M:
| Mini PC | Inference Method | 7B Q4 tok/s | 13B Q4 tok/s | 70B Q4 tok/s |
|---|---|---|---|---|
| MS-S1 Max (128 GB) | iGPU offload | 15-20 | 10-15 | 5-8 |
| SER9 Pro (32 GB) | iGPU offload | 18-25 | 10-14 | N/A (RAM) |
| UM890 Pro (96 GB) | CPU + iGPU | 12-18 | 8-12 | 2-4 |
| N100 (16 GB) | CPU only | 6-9 | 4-6* | N/A (RAM) |
*13B on the N100 runs under severe memory pressure and may swap.
For context: an RTX 4090 runs Llama 3 8B Q4 at 100-130 tok/s. A Mac Mini M4 Pro with 48 GB unified memory runs it at 40-60 tok/s. Mini PCs are playing in a different league — the question is whether the league they play in is fast enough for your use case.
For single-user interactive chat, 15+ tok/s feels responsive. For code completion integrations, 10+ tok/s is workable. For batch processing where latency does not matter, even 5 tok/s is acceptable. Below 5 tok/s, you are testing patience.
My Recommendation
If your budget allows it and you want to run large models in a compact form factor, the Minisforum MS-S1 Max with 128 GB unified memory is the clear pick. Nothing else in the mini PC category can load a 70B model and generate tokens at usable speeds.
For most home lab builders, the Beelink SER9 Pro at ~$729 hits the right balance. It runs 7B-13B models at interactive speeds and doubles as a capable home server. The 32 GB RAM ceiling is a real limitation — but for the 7B-13B models that are practical for personal use, it is sufficient.
If upgradeability matters, the Minisforum UM890 Pro with its SO-DIMM slots and OCuLink port gives you a path to 96 GB RAM and eGPU acceleration. It is the most future-proof option at ~$855.
And if you just want to learn — install Ollama, experiment with prompts, build API integrations — an N100 or N150 mini PC is the cheapest way to start — the Beelink Mini S12 Pro is currently unavailable, but check our N100/N150 roundup for alternatives. Slow inference is better than no inference.
For workloads that genuinely need GPU-class performance — fine-tuning, 70B at production speeds, multiple concurrent users — a mini PC is not the right tool. See our best GPU for local LLMs guide for those use cases.
Minisforum MS-S1 Max
~$3,040- CPU
- AMD Ryzen AI Max+ 395 (16C/32T, Zen 5)
- RAM
- 128 GB LPDDR5X-8000 (unified, soldered)
- iGPU
- Radeon 8060S (40 CU, RDNA 3.5)
- NPU
- XDNA 2 — 50 TOPS
- Storage
- 2x M.2 PCIe 5.0 + half-height PCIe slot
- Networking
- 2x 10GbE + USB4 v2
- TDP
- 110–160W (configurable)
- Price
- ~$3,040
The first mini PC that can genuinely run 70B quantized LLMs at usable speeds. 128 GB of unified memory means no VRAM bottleneck — the Radeon 8060S iGPU sees all 128 GB as addressable memory for inference. Dual 10GbE makes it viable as a shared inference server.
Beelink SER9 Pro
~$729- CPU
- AMD Ryzen AI 9 HX 370 (12C/24T, Zen 5)
- RAM
- 32 GB LPDDR5X-8000 (soldered)
- iGPU
- Radeon 890M (16 CU, RDNA 3.5)
- NPU
- XDNA 2 — 50 TOPS
- Storage
- 2x M.2 2280 PCIe 4.0
- Networking
- WiFi 6 + 2.5GbE
- TDP
- 28–65W
- Price
- ~$729
The sweet spot for local AI on a budget. The Ryzen AI 9 HX 370 with 12 Zen 5 cores and Radeon 890M iGPU handles 7B models at 18-25 tok/s. 32 GB RAM is the ceiling — enough for 7B-13B quantized models but not larger.
Minisforum UM890 Pro
~$855- CPU
- AMD Ryzen 9 8945HS (8C/16T, Zen 4)
- RAM
- 32 GB DDR5-5600 (upgradeable to 96 GB)
- iGPU
- Radeon 780M (12 CU, RDNA 3)
- NPU
- XDNA 1 — 16 TOPS
- Storage
- 2x M.2 2280 PCIe 4.0
- Networking
- 2x 2.5GbE + WiFi 6E
- TDP
- 45–70W
- Price
- ~$855
The only mini PC here with upgradeable RAM — max 96 GB via standard SO-DIMM slots. That means you can load 70B Q4 models into memory for CPU inference, trading speed for model size. The Radeon 780M is older but functional for iGPU offload on smaller models.
Beelink Mini S12 Pro
~$170- CPU
- Intel N100 (4C/4T, up to 3.4 GHz)
- RAM
- 16 GB DDR5-4800
- iGPU
- Intel UHD (24 EU)
- NPU
- None
- Storage
- 1x M.2 2280 PCIe 3.0
- Networking
- 2x 1GbE + WiFi 6
- TDP
- 6W
- Price
- ~$170
A $170 test bench for local AI experimentation. The N100 runs 7B Q4 models at 6-9 tok/s via CPU — slow but functional for single-user chat. No NPU, no meaningful iGPU compute. This is about learning the tooling, not production inference.
Frequently Asked Questions
Can a mini PC actually run local LLMs?
Does the NPU help with LLM inference?
How much RAM do I need for local AI inference?
Is DDR5 speed important for LLM inference?
Should I buy a mini PC or a GPU server for local AI?
Get our weekly picks
The best home lab deals and new reviews, every week. Free, no spam.
Join home lab builders who get deals first.