Skip to main content
Fix CUDA Out of Memory in PyTorch: 10 Proven Solutions

Fix CUDA Out of Memory in PyTorch: 10 Proven Solutions


If you’ve trained a model in PyTorch, you’ve hit this error: RuntimeError: CUDA out of memory. It’s the most common GPU error in deep learning. This guide covers 10 proven fixes, ranked from simplest to most advanced, so you can get back to training.

Understanding the Error

RuntimeError: CUDA out of memory. Tried to allocate 512.00 MiB. GPU 0 has a total capacity of 23.64 GiB of which 211.00 MiB is free.

This means PyTorch tried to allocate GPU memory for a tensor but the GPU didn’t have enough free VRAM. The causes are usually: batch size too large, model too large, or memory leaks from unreleased tensors.

Before you start fixing: Run nvidia-smi in your terminal to see current GPU memory usage. If another process is using VRAM, kill it first. This is the #1 cause people overlook.

Advertisement

1. Reduce Batch Size

Easiest FixMemory savings: 30-70%

The most straightforward fix. GPU memory scales linearly with batch size, so halving your batch size roughly halves VRAM usage.

Example

# Before (OOM with 24GB GPU)

train_loader = DataLoader(dataset, batch_size=64)

# After (fits in memory)

train_loader = DataLoader(dataset, batch_size=16)

Tip: Start with batch_size=1 to confirm the model fits at all, then increase until you hit OOM. The sweet spot is usually 80-90% of your available VRAM.

2. Enable Mixed Precision (AMP)

Most ImpactfulMemory savings: 30-50%

Automatic Mixed Precision (AMP) trains your model in FP16 where safe while keeping critical operations in FP32. This nearly halves memory usage with minimal accuracy loss. Available in PyTorch 1.6+.

Implementation

from torch.amp import autocast, GradScaler

scaler = GradScaler(device_type=“cuda”)

for data, target in dataloader:

    optimizer.zero_grad()

    with autocast(device_type=“cuda”):

        output = model(data)

        loss = criterion(output, target)

    scaler.scale(loss).backward()

    scaler.step(optimizer)

    scaler.update()

Pros
  • 30-50% memory reduction
  • Often faster training (Tensor Cores)
  • Minimal code changes
Requirements
  • NVIDIA GPU with Tensor Cores (RTX 20+)
  • PyTorch 1.6 or later
  • CUDA 10.0+

3. Gradient Checkpointing

Large ModelsMemory savings: 40-60%

Gradient checkpointing trades compute time for memory. Instead of storing all intermediate activations for the backward pass, it recomputes them on the fly. Training is ~20-30% slower but uses dramatically less VRAM.

Implementation

from torch.utils.checkpoint import checkpoint

# Wrap memory-heavy layers

class MyModel(nn.Module):

    def forward(self, x):

        x = checkpoint(self.heavy_block, x,

            use_reentrant=False)

        return self.head(x)

# For Hugging Face Transformers

model.gradient_checkpointing_enable()

4. Gradient Accumulation

Best of BothKeeps effective batch size large

Simulate a large batch size while only loading a small batch into VRAM at a time. The gradients accumulate over multiple forward passes before updating weights.

Implementation

accumulation_steps = 4 # Effective batch = 4 x batch_size

for i, (data, target) in enumerate(loader):

    with autocast(device_type=“cuda”):

        loss = model(data, target) / accumulation_steps

    scaler.scale(loss).backward()

    if (i + 1) % accumulation_steps == 0:

        scaler.step(optimizer)

        scaler.update()

        optimizer.zero_grad()

Advertisement

5. Clear the CUDA Cache

Quick FixFor notebooks and interactive sessions

PyTorch’s memory allocator caches GPU memory blocks for reuse. In Jupyter notebooks or interactive sessions, old tensors can linger. Clear them explicitly.

import torch

import gc

# Delete unused variables

del model, optimizer, outputs

# Run garbage collection

gc.collect()

# Release cached memory back to CUDA

torch.cuda.empty_cache()

# Verify memory freed

print(torch.cuda.memory_summary())

6. Use In-place Operations

In-place operations modify tensors without creating copies, saving memory. Use them for activations where autograd compatibility allows.

# Instead of

x = F.relu(x)

# Use in-place version

x = F.relu(x, inplace=True)

# Or in nn.Sequential

nn.ReLU(inplace=True)

Warning: In-place operations can break autograd in some cases. Don’t use them on tensors that require gradients for loss computation. Safe for activations between layers.

7. Offload to CPU

Move tensors you don’t need on GPU back to CPU RAM. This is especially useful for storing intermediate results, logging, or when processing large datasets.

# Move loss to CPU before storing

train_losses.append(loss.detach().cpu().item())

# Don’t keep prediction tensors on GPU

predictions = model(batch).detach().cpu()

8. Model Quantization

For InferenceMemory savings: 50-75%

Quantization reduces model weights from FP32 to INT8 or INT4, drastically cutting memory. Best suited for inference; for training, use AMP instead.

# BitsAndBytes 4-bit quantization

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(

    load_in_4bit=True,

    bnb_4bit_compute_dtype=torch.float16,

    bnb_4bit_quant_type=“nf4”,

)

model = AutoModelForCausalLM.from_pretrained(

    “meta-llama/Llama-4-Scout-17B-16E”,

    quantization_config=bnb_config,

)

9. Fix Memory Fragmentation

Sometimes you have enough total free VRAM, but it’s fragmented into small blocks. PyTorch’s allocator can’t find a single contiguous block large enough. The CUDA allocator config can help.

# Set before running your script

export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128

# Or in Python

import os

os.environ[“PYTORCH_CUDA_ALLOC_CONF”] = “max_split_size_mb:128”

10. Combine Multiple Techniques

The real power comes from stacking these techniques together. Here’s what to combine based on your VRAM:

GPU VRAMRecommended StackWhat You Can Train
8 GBAMP + Small batch + Gradient accumulationResNets, small transformers, LoRA fine-tuning
12 GBAMP + Gradient checkpointing + AccumulationVision Transformers, 7B LLM fine-tuning (QLoRA)
16 GBAMP + CheckpointingMost research models, Stable Diffusion training
24 GBAMP + Standard batch sizesLarge models, 13B fine-tuning, FLUX image gen
32 GB+AMP (optional at this level)Most workloads without memory tricks
Advertisement

Diagnostic Cheat Sheet

Memory Debugging Commands

Check GPU memory usage

nvidia-smi

torch.cuda.memory_summary(device=0)

Find memory-heavy tensors

torch.cuda.memory_allocated() / 1e9 # GB in use

torch.cuda.memory_reserved() / 1e9 # GB cached

Monitor during training

watch -n 0.5 nvidia-smi # Live monitoring

Need More VRAM?

If you’re constantly hitting memory limits, it might be time for a GPU upgrade.