Navigation

GPU & CUDA Errors

Troubleshoot common NVIDIA GPU and CUDA errors in deep learning. Fix CUDA out of memory, driver issues, cuDNN errors, and GPU detection problems in PyTorch and TensorFlow.

Back to troubleshooting โ†’

Overview

This guide covers common NVIDIA GPU errors and their solutions for deep learning workloads. Most issues fall into these categories:

  • NUMA node affinity problems
  • GPU detection and availability
  • CUDA compatibility issues

NUMA Node Error

Symptoms

  • Error message: successful NUMA node read from SysFS had negative value (-1)
  • Performance degradation in multi-GPU systems

Root Cause

The NUMA node setting resets to -1 on every system reboot, causing memory allocation issues.

Solution

#1)Identify the PCI-ID (with domain) of your GPU
#For example: PCI_ID=โ€œ0000.81:00.0โ€
lspci -D | grep NVIDIA
# 2) Add a crontab for root
sudo crontab -e
#Add the following line.
#This guarantees that the NUMA affinity is set to 0 for the GPU device on every reboot.
@reboot (echo 0 | tee -a โ€œ/sys/bus/pci/devices/<PCI_ID>/numa_nodeโ€)

#Keep in mind that this is only a โ€œshallowโ€ fix as the Nvidia driver is unaware of it:
#Locally you would have some different PCI_ID, so replace it with your own.
#Such as 0000:0b:00.0, so example:
@reboot (echo 0 | tee -a โ€œ/sys/bus/pci/devices/0000:0b:00.0/numa_nodeโ€)

# Verify the fix
nvidia-smi topo -m

:::caution[Persistent Fix Required] This is a โ€œshallowโ€ fix as the NVIDIA driver is unaware of it. The crontab ensures the setting persists across reboots. :::

Discussion on StackOverflow

external

GPU Detection Issues

Verify GPU Availability

import torch

# Check if CUDA is available
print(f"CUDA available: {torch.cuda.is_available()}")
# Should return: True

# Get GPU name
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU count: {torch.cuda.device_count()}")
# Example output: GPU: NVIDIA GeForce RTX 4090
import tensorflow as tf

# List physical GPUs
gpus = tf.config.list_physical_devices('GPU')
print(f"GPUs available: {len(gpus)}")
for gpu in gpus:
    print(f"  - {gpu}")

Common Causes of GPU Not Detected

  1. Driver issues - See Driver Installation
  2. CUDA version mismatch - Verify PyTorch/TensorFlow CUDA compatibility
  3. Environment problems - Check Environment Setup