GPU & CUDA Errors
Troubleshoot common NVIDIA GPU and CUDA errors in deep learning. Fix CUDA out of memory, driver issues, cuDNN errors, and GPU detection problems in PyTorch and TensorFlow.
Back to troubleshooting โOverview
This guide covers common NVIDIA GPU errors and their solutions for deep learning workloads. Most issues fall into these categories:
- NUMA node affinity problems
- GPU detection and availability
- CUDA compatibility issues
NUMA Node Error
Symptoms
- Error message:
successful NUMA node read from SysFS had negative value (-1) - Performance degradation in multi-GPU systems
Root Cause
The NUMA node setting resets to -1 on every system reboot, causing memory allocation issues.
Solution
#1)Identify the PCI-ID (with domain) of your GPU
#For example: PCI_ID=โ0000.81:00.0โ
lspci -D | grep NVIDIA
# 2) Add a crontab for root
sudo crontab -e
#Add the following line.
#This guarantees that the NUMA affinity is set to 0 for the GPU device on every reboot.
@reboot (echo 0 | tee -a โ/sys/bus/pci/devices/<PCI_ID>/numa_nodeโ)
#Keep in mind that this is only a โshallowโ fix as the Nvidia driver is unaware of it:
#Locally you would have some different PCI_ID, so replace it with your own.
#Such as 0000:0b:00.0, so example:
@reboot (echo 0 | tee -a โ/sys/bus/pci/devices/0000:0b:00.0/numa_nodeโ)
# Verify the fix
nvidia-smi topo -m
:::caution[Persistent Fix Required] This is a โshallowโ fix as the NVIDIA driver is unaware of it. The crontab ensures the setting persists across reboots. :::
Discussion on StackOverflow
GPU Detection Issues
Verify GPU Availability
import torch
# Check if CUDA is available
print(f"CUDA available: {torch.cuda.is_available()}")
# Should return: True
# Get GPU name
if torch.cuda.is_available():
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"GPU count: {torch.cuda.device_count()}")
# Example output: GPU: NVIDIA GeForce RTX 4090 import tensorflow as tf
# List physical GPUs
gpus = tf.config.list_physical_devices('GPU')
print(f"GPUs available: {len(gpus)}")
for gpu in gpus:
print(f" - {gpu}") Common Causes of GPU Not Detected
- Driver issues - See Driver Installation
- CUDA version mismatch - Verify PyTorch/TensorFlow CUDA compatibility
- Environment problems - Check Environment Setup
Related Resources
- Driver Installation Guide - Install or update NVIDIA drivers
- GPU Memory Management - Handle OOM errors
- Multi-GPU Setup - Configure multiple GPUs