How to Fix Out of Memory Errors in Ollama

March 20, 2026

Faveren Caleb

How to Fix Out of Memory Errors in Ollama

Out of memory errors in Ollama mean the model you are trying to run requires more VRAM than your GPU currently has available. The fix is almost always one of two things: running a smaller quantized variant of the model, or reducing the context window. This post covers both.

Table of Contents

Check What Is Already Using Your VRAM
Fix 1: Switch to a Smaller Quantization
Fix 2: Reduce the Context Window
The Takeaway

Check What Is Already Using Your VRAM

Before changing anything, confirm the error is actually about total capacity and not a lingering model eating up space in the background.

Run this on the host:

nvidia-smi

Look at the free memory figure. If it is lower than expected, something else is holding VRAM. Then check what Ollama itself has loaded:

ollama ps

If a model is sitting there from a previous session, stop it:

ollama stop <model_name>

Restart the Ollama service to fully clear any fragmented allocations:

sudo systemctl restart ollama

Run nvidia-smi again. If free memory is now close to your card’s full capacity, try loading the model again before making any other changes. Many OOM errors end here.

Fix 1: Switch to a Smaller Quantization

Quantization reduces model weight precision to shrink the amount of VRAM required. When you pull a model from Ollama’s library, the default tag is usually a Q4 variant, but the size still varies significantly by model family.

The error message CUDA error: out of memory after the model starts loading, it means the weights themselves do not fit. The fix is to pull a more aggressively quantized variant.

For a sense of how much VRAM each quantization level requires, here is what a 7B and 70B model looks like across the common options:

Quantization	Approx VRAM (7B)	Approx VRAM (70B)
FP16	~14 GB	~140 GB
Q8_0	~7 GB	~70 GB
Q5_K_M	~5 GB	~43 GB
Q4_K_M	~4.2 GB	~36 GB
Q3_K_M	~3.3 GB	~27 GB
Q2_K	~2.7 GB	~21 GB

Q4_K_M is the practical starting point for most consumer GPUs. If that still fails, drop to Q3_K_M. The quality loss between Q4 and Q3 is measurable but acceptable for most tasks. Q2 is a last resort noticeable degradation, but it runs.

To pull a specific quantization:

ollama pull llama3:8b-instruct-q4_K_M

For models sourced from Hugging Face, look for the GGUF tag and filter by quantization. Creators like bartowski and unsloth publish clean GGUF variants across all quantization levels. The Hugging Face GGUF model list is the fastest way to find them, filter by model name, and check available quantization tags before pulling.

Fix 2: Reduce the Context Window

If the model weights load correctly but the error appears mid-conversation, the KV cache is the likely cause. The KV cache stores attention data for every token in the current context. The longer the context, the more VRAM it consumes, and this compounds quickly at high context lengths.

The error ggml_cuda_device_malloc: out of memory during a conversation points directly at this.

Reduce the context window for the current session inside a chat:

/set parameter num_ctx 4096

To make it permanent, create a Modelfile:

FROM llama3:8b
PARAMETER num_ctx 2048

Build it:

ollama create my-model -f Modelfile

For most use cases, Q&A, coding assistance, summarising short documents of 2048 to 4096 tokens is sufficient. A 128k context is only necessary for tasks that require reasoning across very long documents in a single pass. Running it on hardware that cannot support it will crash every time.

You can also cap the context server-wide by setting an environment variable before starting Ollama:

OLLAMA_CONTEXT_LENGTH=4096 ollama serve

The Takeaway

Out of memory errors in Ollama are caused by model weights or KV cache exceeding available VRAM. Check ollama ps first to rule out a background model holding memory. If the error appears on load, pull a Q4_K_M or Q3_K_M variant instead. If it crashes mid-conversation, reduce num_ctx to 2048 or 4096.