Navigation

Monitoring & Maintenance

Keep your AI workstation healthy with monitoring, maintenance schedules, and system health checks. Track GPU temperatures, disk space, and prevent hardware failures in deep learning systems.

Why Monitoring Matters

Your deep learning workstation is a significant investment. Proper monitoring and maintenance:

  • Prevents hardware failures - Catch issues before they cause damage
  • Maintains performance - Avoid gradual degradation
  • Saves money - Early detection prevents costly repairs
  • Ensures reliability - No interrupted training runs

What to Monitor

GPU Health

  • Temperature - Prevent thermal throttling and damage
  • Fan speed - Ensure adequate cooling
  • Power draw - Detect anomalies
  • Memory errors - ECC errors (on professional GPUs)
  • Clock speeds - Check for throttling

System Health

  • CPU temperature - Prevent throttling
  • RAM usage - Detect memory leaks
  • Disk usage - Avoid running out of space
  • Network - For multi-node training
  • PSU health - Power delivery issues

Training Metrics

  • GPU utilization - Is your GPU fully utilized?
  • Training speed - Samples/second throughput
  • Data loading time - Identify bottlenecks
  • Loss curves - Training progress

Monitoring Tools

Real-Time Monitoring

nvidia-smi - Built-in NVIDIA tool

# One-time check
nvidia-smi

# Continuous monitoring
watch -n 1 nvidia-smi

# Log to file
nvidia-smi --query-gpu=timestamp,temperature.gpu,utilization.gpu,memory.used \
  --format=csv --loop=10 > gpu_log.csv

nvtop - Better visualization

# Install
sudo apt install nvtop

# Run
nvtop

htop - CPU and memory

sudo apt install htop
htop

Long-Term Monitoring

Prometheus + Grafana - Industry standard

  • Collect metrics over time
  • Beautiful dashboards
  • Alerting on issues

TensorBoard - For ML metrics

  • Built into TensorFlow
  • PyTorch integration available
  • Visualize training progress

Weights & Biases - Cloud-based

  • Experiment tracking
  • Hardware monitoring
  • Team collaboration

Maintenance Schedule

Daily (Automated)

  • GPU temperature check
  • Disk space monitoring
  • Training job status

Weekly

  • Review temperature logs
  • Check for driver updates
  • Clean browser cache/temp files
  • Review system logs for errors

Monthly

  • Physical dust cleaning
  • Check thermal paste (if temps rising)
  • Verify all fans working
  • Update system packages
  • Review power consumption

Quarterly

  • Deep clean (blow out dust thoroughly)
  • Check all cable connections
  • Update firmware (motherboard, GPU)
  • Benchmark and compare to baseline
  • Review and optimize storage

Yearly

  • Replace thermal paste
  • Check PSU health
  • Review warranty status
  • Plan upgrades if needed

Critical Thresholds

GPU Temperature

Ideal:

  • Idle: Below 50°C
  • Training: 60-75°C
  • Max acceptable: 80°C

Action required:

  • 80-85°C: Check cooling, clean dust
  • Above 85°C: Stop training, investigate immediately

GPU Power Draw

Normal:

  • RTX 4090: 350-450W
  • RTX 4080: 250-320W
  • A100: 250-400W

Concerning:

  • Constant max power (possible inefficiency)
  • Fluctuating wildly (unstable workload)
  • Lower than expected (throttling)

Fan Speed

Healthy curve:

  • Below 60°C: 30-40% fan speed
  • 60-70°C: 50-60% fan speed
  • 70-80°C: 70-80% fan speed
  • Above 80°C: 90-100% fan speed

Concerning:

  • Fans at 100% constantly (cooling issue)
  • Fans not spinning up (fan failure or curve issue)

Remote Monitoring

SSH Access

Enable SSH server:

sudo apt install openssh-server
sudo systemctl enable ssh
sudo systemctl start ssh

Monitor remotely:

# SSH into machine
ssh user@workstation-ip

# Check GPUs
nvidia-smi

# Check system
htop

Web Dashboards

Netdata - Real-time monitoring

bash <(curl -Ss https://my-netdata.io/kickstart.sh)

# Access at http://workstation-ip:19999

Glances - Web UI for system stats

pip install glances[web]
glances -w  # Access at http://workstation-ip:61208

Mobile Monitoring

Termux (Android) - SSH from phone Blink (iOS) - SSH client

Set up alerts:

# Send email on high temperature
# (Configure with cron)
temp=$(nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader)
if [ $temp -gt 85 ]; then
    echo "GPU temp high: $temp°C" | mail -s "Alert" [email protected]
fi

Preventive Maintenance

Dust Management

Signs of dust buildup:

  • Rising temperatures over time
  • Louder fans
  • More frequent thermal throttling

Cleaning:

  • Use compressed air
  • Hold fans while blowing (prevent spin damage)
  • Clean monthly if in dusty environment

Thermal Paste Replacement

When to replace:

  • Every 12-18 months
  • If temps rising 10°C+ from baseline
  • After moving system

For GPUs:

  • Voids warranty usually
  • Only if out of warranty
  • Use quality paste (Thermal Grizzly, Noctua)

Alerts & Automation

Temperature Alerts

Script to monitor and alert:

#!/bin/bash
# save as gpu_temp_alert.sh

THRESHOLD=85
EMAIL="[email protected]"

while true; do
    temp=$(nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader,nounits)

    if [ $temp -gt $THRESHOLD ]; then
        echo "GPU temperature critical: ${temp}°C" | mail -s "GPU Alert" $EMAIL
        # Optional: shutdown
        # sudo shutdown -h now
    fi

    sleep 60
done

Run on startup:

# Add to crontab
crontab -e

# Add line:
@reboot /path/to/gpu_temp_alert.sh &

Disk Space Alerts

# Check disk usage
df -h

# Alert if >90% full
usage=$(df / | tail -1 | awk '{print $5}' | sed 's/%//')
if [ $usage -gt 90 ]; then
    echo "Disk usage critical: ${usage}%" | mail -s "Disk Alert" $EMAIL
fi

Troubleshooting

High Temperatures

Causes:

  1. Dust buildup
  2. Poor airflow
  3. Ambient temperature high
  4. Thermal paste dried out
  5. Fan failure

Solutions:

  1. Clean system
  2. Improve case airflow
  3. Underclock GPU slightly
  4. Replace thermal paste
  5. Replace fans

Performance Degradation

Symptoms:

  • Training slower than before
  • Lower GPU utilization
  • Longer iteration times

Causes:

  1. Thermal throttling
  2. Driver issues
  3. Background processes
  4. Storage degradation

Diagnostics:

# Check throttling
nvidia-smi -q -d PERFORMANCE

# Monitor clocks
nvidia-smi -q -d CLOCK

# Check for background processes
htop

Crashes During Training

Potential causes:

  1. Power delivery issues (PSU)
  2. Overheating
  3. RAM errors
  4. Driver bugs
  5. Overclocking instability

Solutions:

  1. Test with different PSU
  2. Monitor temps closely
  3. Run memtest86+
  4. Update/downgrade drivers
  5. Reset to stock clocks

Data Backup

Critical data to backup:

  • Model checkpoints
  • Training configs
  • Custom code
  • Processed datasets (if expensive to regenerate)

Backup strategies:

# Local backup to external drive
rsync -av --progress /path/to/models /mnt/external/backup/

# Cloud backup (rclone to Google Drive, S3, etc.)
rclone sync /path/to/models remote:backup/

# Automated daily backup
# Add to crontab
0 2 * * * rsync -av /path/to/models /mnt/external/backup/

Monitoring Stack Setup

Recommended setup:

  1. Basic: nvidia-smi + htop
  2. Intermediate: nvtop + Netdata web dashboard
  3. Advanced: Prometheus + Grafana + TensorBoard

Next Steps

  • Monitor GPU health with nvidia-smi and nvtop
  • Set up temperature alerts (see scripts above)
  • Enable remote access via SSH
  • Configure web dashboards for long-term tracking

:::tip[Start Simple] Begin with basic monitoring (nvidia-smi, htop) and add more sophisticated tools as needed. Don’t over-engineer your monitoring setup initially. :::