GPU & CUDA Errors

Troubleshoot common NVIDIA GPU and CUDA errors in deep learning. Fix CUDA out of memory, driver issues, cuDNN errors, and GPU detection problems in PyTorch and TensorFlow.

Back to troubleshooting →

Overview

This guide covers common NVIDIA GPU errors and their solutions for deep learning workloads. Most issues fall into these categories:

NUMA node affinity problems
GPU detection and availability
CUDA compatibility issues

NUMA Node Error

Symptoms

Error message: successful NUMA node read from SysFS had negative value (-1)
Performance degradation in multi-GPU systems

Root Cause

The NUMA node setting resets to -1 on every system reboot, causing memory allocation issues.

Solution

#1)Identify the PCI-ID (with domain) of your GPU
#For example: PCI_ID=“0000.81:00.0”
lspci -D | grep NVIDIA
# 2) Add a crontab for root
sudo crontab -e
#Add the following line.
#This guarantees that the NUMA affinity is set to 0 for the GPU device on every reboot.
@reboot (echo 0 | tee -a “/sys/bus/pci/devices/<PCI_ID>/numa_node”)

#Keep in mind that this is only a “shallow” fix as the Nvidia driver is unaware of it:
#Locally you would have some different PCI_ID, so replace it with your own.
#Such as 0000:0b:00.0, so example:
@reboot (echo 0 | tee -a “/sys/bus/pci/devices/0000:0b:00.0/numa_node”)

# Verify the fix
nvidia-smi topo -m

:::caution[Persistent Fix Required] This is a “shallow” fix as the NVIDIA driver is unaware of it. The crontab ensures the setting persists across reboots. :::

Discussion on StackOverflow

external

GPU Detection Issues

Verify GPU Availability

import torch

# Check if CUDA is available
print(f"CUDA available: {torch.cuda.is_available()}")
# Should return: True

# Get GPU name
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU count: {torch.cuda.device_count()}")
# Example output: GPU: NVIDIA GeForce RTX 4090

import tensorflow as tf

# List physical GPUs
gpus = tf.config.list_physical_devices('GPU')
print(f"GPUs available: {len(gpus)}")
for gpu in gpus:
    print(f"  - {gpu}")

Common Causes of GPU Not Detected

Driver issues - See Driver Installation
CUDA version mismatch - Verify PyTorch/TensorFlow CUDA compatibility
Environment problems - Check Environment Setup

Driver Installation Guide - Install or update NVIDIA drivers
GPU Memory Management - Handle OOM errors
Multi-GPU Setup - Configure multiple GPUs