Out of memory errors in Ollama mean the model you are trying to run requires more VRAM than your GPU currently has available. The fix is almost always one of two things: running a smaller quantized variant of the model, or reducing the context window. This post covers both.
Check What Is Already Using Your VRAM
Before changing anything, confirm the error is actually about total capacity and not a lingering model eating up space in the background.
Run this on the host:
nvidia-smi
Look at the free memory figure. If it is lower than expected, something else is holding VRAM. Then check what Ollama itself has loaded:
ollama ps
If a model is sitting there from a previous session, stop it:
ollama stop <model_name>
Restart the Ollama service to fully clear any fragmented allocations:
sudo systemctl restart ollama
Run nvidia-smi again. If free memory is now close to your card’s full capacity, try loading the model again before making any other changes. Many OOM errors end here.
Fix 1: Switch to a Smaller Quantization
Quantization reduces model weight precision to shrink the amount of VRAM required. When you pull a model from Ollama’s library, the default tag is usually a Q4 variant, but the size still varies significantly by model family.
The error message CUDA error: out of memory after the model starts loading, it means the weights themselves do not fit. The fix is to pull a more aggressively quantized variant.
For a sense of how much VRAM each quantization level requires, here is what a 7B and 70B model looks like across the common options:
| Quantization | Approx VRAM (7B) | Approx VRAM (70B) |
|---|---|---|
| FP16 | ~14 GB | ~140 GB |
| Q8_0 | ~7 GB | ~70 GB |
| Q5_K_M | ~5 GB | ~43 GB |
| Q4_K_M | ~4.2 GB | ~36 GB |
| Q3_K_M | ~3.3 GB | ~27 GB |
| Q2_K | ~2.7 GB | ~21 GB |
Q4_K_M is the practical starting point for most consumer GPUs. If that still fails, drop to Q3_K_M. The quality loss between Q4 and Q3 is measurable but acceptable for most tasks. Q2 is a last resort noticeable degradation, but it runs.
To pull a specific quantization:
ollama pull llama3:8b-instruct-q4_K_M
For models sourced from Hugging Face, look for the GGUF tag and filter by quantization. Creators like bartowski and unsloth publish clean GGUF variants across all quantization levels. The Hugging Face GGUF model list is the fastest way to find them, filter by model name, and check available quantization tags before pulling.
Fix 2: Reduce the Context Window
If the model weights load correctly but the error appears mid-conversation, the KV cache is the likely cause. The KV cache stores attention data for every token in the current context. The longer the context, the more VRAM it consumes, and this compounds quickly at high context lengths.
The error ggml_cuda_device_malloc: out of memory during a conversation points directly at this.
Reduce the context window for the current session inside a chat:
/set parameter num_ctx 4096
To make it permanent, create a Modelfile:
FROM llama3:8b
PARAMETER num_ctx 2048
Build it:
ollama create my-model -f Modelfile
For most use cases, Q&A, coding assistance, summarising short documents of 2048 to 4096 tokens is sufficient. A 128k context is only necessary for tasks that require reasoning across very long documents in a single pass. Running it on hardware that cannot support it will crash every time.
You can also cap the context server-wide by setting an environment variable before starting Ollama:
OLLAMA_CONTEXT_LENGTH=4096 ollama serve
The Takeaway
Out of memory errors in Ollama are caused by model weights or KV cache exceeding available VRAM. Check ollama ps first to rule out a background model holding memory. If the error appears on load, pull a Q4_K_M or Q3_K_M variant instead. If it crashes mid-conversation, reduce num_ctx to 2048 or 4096.
