CPU inference, running AI models on a processor rather than a graphics card, is not only possible, it is practical for a wide range of real tasks. The assumption that AI requires a GPU is a training-time concern that does not fully carry over to inference.
Why CPUs Can Work for AI Inference
The reason GPUs dominate AI is parallelism; they can perform thousands of operations simultaneously, which is essential during training when gradients need to be calculated across billions of parameters at once. Inference is a different workload. You are not training, you are running a forward pass through a fixed model, one token at a time.
That workload is memory-bound, not compute-bound. The bottleneck is not how fast your processor can calculate; modern CPUs handle that comfortably. It is how quickly data can move from RAM to the processor. A 7B model at 4-bit quantization sits at around 3.5GB in RAM. Every token generated requires streaming those weights through the CPU. You are always waiting on memory, not computation. CPUs and GPUs face the same constraint here; the gap is narrower than it appears.
What You Can Realistically Run
Performance depends on model size, quantization level, and how many CPU cores and how much RAM you have. The table below reflects what a modern 16 to 24-core machine delivers at Q4 quantization.
| Model Size | RAM Needed | Performance | Best For |
|---|---|---|---|
| 1B–3B | 4–8GB | 70–125 tokens/sec | Chat, summarisation, quick tasks |
| 4B–8B | 8–16GB | 40–60 tokens/sec | Reasoning, instruction following |
| 12B–15B | 16–32GB | 23–25 tokens/sec | Complex tasks, longer context |
| 70B+ | 35GB+ | 5–10 tokens/sec | Feasible but slow for interactive use |
The sweet spot is models under 8B parameters. At 40 to 60 tokens per second, output arrives faster than most people read. For interactive use, chat, coding assistance, and document summarisation that is entirely usable. Larger models become more practical for batch processing or async tasks where latency is less critical.
The Software That Makes It Work
CPU inference was impractical until recently, not because of hardware limitations but because frameworks treated CPU support as an afterthought. That has changed.
llama.cpp is where CPU inference became mainstream. A C++ implementation optimised for memory efficiency, it introduced the GGUF model format that is now the standard for quantized local models. It runs on any hardware, requires no GPU, and is what tools like Ollama are built on top of.
Ollama wraps llama.cpp in a simple interface, one command to pull a model, one command to run it. The easiest starting point for anyone who wants CPU inference without configuration overhead.
Where CPUs Fall Short
CPU inference has real limits worth being clear about. Training and fine-tuning remain GPU territory; the backward pass and gradient updates during training benefit from GPU parallelism in ways that inference does not, and CPUs are orders of magnitude slower for that workload.
High-concurrency production serving is also a GPU strength. If you need to handle hundreds of simultaneous queries with sub-100ms latency, a GPU cluster is the right tool. CPUs handle moderate traffic well, but do not scale the same way under load.
Image generation, video, and audio synthesis are GPU-native workloads. CPU implementations exist but are slow enough to be impractical for most use cases.
The Bottom Line
For inference on models up to around 8B parameters, a modern CPU with sufficient RAM is a legitimate option, not a compromise you tolerate, but a deliberate choice that costs less, runs privately on hardware you already own, and works without any GPU at all. The software has caught up to make it practical, and quantization has made the models small enough to fit. The use cases where a GPU is genuinely necessary are real, but they are narrower than the conventional wisdom suggests.
