Monitoring & Maintenance
Keep your AI workstation healthy with monitoring, maintenance schedules, and system health checks. Track GPU temperatures, disk space, and prevent hardware failures in deep learning systems.
Why Monitoring Matters
Your deep learning workstation is a significant investment. Proper monitoring and maintenance:
- Prevents hardware failures - Catch issues before they cause damage
- Maintains performance - Avoid gradual degradation
- Saves money - Early detection prevents costly repairs
- Ensures reliability - No interrupted training runs
What to Monitor
GPU Health
- Temperature - Prevent thermal throttling and damage
- Fan speed - Ensure adequate cooling
- Power draw - Detect anomalies
- Memory errors - ECC errors (on professional GPUs)
- Clock speeds - Check for throttling
System Health
- CPU temperature - Prevent throttling
- RAM usage - Detect memory leaks
- Disk usage - Avoid running out of space
- Network - For multi-node training
- PSU health - Power delivery issues
Training Metrics
- GPU utilization - Is your GPU fully utilized?
- Training speed - Samples/second throughput
- Data loading time - Identify bottlenecks
- Loss curves - Training progress
Monitoring Tools
Real-Time Monitoring
nvidia-smi - Built-in NVIDIA tool
# One-time check
nvidia-smi
# Continuous monitoring
watch -n 1 nvidia-smi
# Log to file
nvidia-smi --query-gpu=timestamp,temperature.gpu,utilization.gpu,memory.used \
--format=csv --loop=10 > gpu_log.csv
nvtop - Better visualization
# Install
sudo apt install nvtop
# Run
nvtop
htop - CPU and memory
sudo apt install htop
htop
Long-Term Monitoring
Prometheus + Grafana - Industry standard
- Collect metrics over time
- Beautiful dashboards
- Alerting on issues
TensorBoard - For ML metrics
- Built into TensorFlow
- PyTorch integration available
- Visualize training progress
Weights & Biases - Cloud-based
- Experiment tracking
- Hardware monitoring
- Team collaboration
Maintenance Schedule
Daily (Automated)
- GPU temperature check
- Disk space monitoring
- Training job status
Weekly
- Review temperature logs
- Check for driver updates
- Clean browser cache/temp files
- Review system logs for errors
Monthly
- Physical dust cleaning
- Check thermal paste (if temps rising)
- Verify all fans working
- Update system packages
- Review power consumption
Quarterly
- Deep clean (blow out dust thoroughly)
- Check all cable connections
- Update firmware (motherboard, GPU)
- Benchmark and compare to baseline
- Review and optimize storage
Yearly
- Replace thermal paste
- Check PSU health
- Review warranty status
- Plan upgrades if needed
Critical Thresholds
GPU Temperature
Ideal:
- Idle: Below 50°C
- Training: 60-75°C
- Max acceptable: 80°C
Action required:
- 80-85°C: Check cooling, clean dust
- Above 85°C: Stop training, investigate immediately
GPU Power Draw
Normal:
- RTX 4090: 350-450W
- RTX 4080: 250-320W
- A100: 250-400W
Concerning:
- Constant max power (possible inefficiency)
- Fluctuating wildly (unstable workload)
- Lower than expected (throttling)
Fan Speed
Healthy curve:
- Below 60°C: 30-40% fan speed
- 60-70°C: 50-60% fan speed
- 70-80°C: 70-80% fan speed
- Above 80°C: 90-100% fan speed
Concerning:
- Fans at 100% constantly (cooling issue)
- Fans not spinning up (fan failure or curve issue)
Remote Monitoring
SSH Access
Enable SSH server:
sudo apt install openssh-server
sudo systemctl enable ssh
sudo systemctl start ssh
Monitor remotely:
# SSH into machine
ssh user@workstation-ip
# Check GPUs
nvidia-smi
# Check system
htop
Web Dashboards
Netdata - Real-time monitoring
bash <(curl -Ss https://my-netdata.io/kickstart.sh)
# Access at http://workstation-ip:19999
Glances - Web UI for system stats
pip install glances[web]
glances -w # Access at http://workstation-ip:61208
Mobile Monitoring
Termux (Android) - SSH from phone Blink (iOS) - SSH client
Set up alerts:
# Send email on high temperature
# (Configure with cron)
temp=$(nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader)
if [ $temp -gt 85 ]; then
echo "GPU temp high: $temp°C" | mail -s "Alert" [email protected]
fi
Preventive Maintenance
Dust Management
Signs of dust buildup:
- Rising temperatures over time
- Louder fans
- More frequent thermal throttling
Cleaning:
- Use compressed air
- Hold fans while blowing (prevent spin damage)
- Clean monthly if in dusty environment
Thermal Paste Replacement
When to replace:
- Every 12-18 months
- If temps rising 10°C+ from baseline
- After moving system
For GPUs:
- Voids warranty usually
- Only if out of warranty
- Use quality paste (Thermal Grizzly, Noctua)
Alerts & Automation
Temperature Alerts
Script to monitor and alert:
#!/bin/bash
# save as gpu_temp_alert.sh
THRESHOLD=85
EMAIL="[email protected]"
while true; do
temp=$(nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader,nounits)
if [ $temp -gt $THRESHOLD ]; then
echo "GPU temperature critical: ${temp}°C" | mail -s "GPU Alert" $EMAIL
# Optional: shutdown
# sudo shutdown -h now
fi
sleep 60
done
Run on startup:
# Add to crontab
crontab -e
# Add line:
@reboot /path/to/gpu_temp_alert.sh &
Disk Space Alerts
# Check disk usage
df -h
# Alert if >90% full
usage=$(df / | tail -1 | awk '{print $5}' | sed 's/%//')
if [ $usage -gt 90 ]; then
echo "Disk usage critical: ${usage}%" | mail -s "Disk Alert" $EMAIL
fi
Troubleshooting
High Temperatures
Causes:
- Dust buildup
- Poor airflow
- Ambient temperature high
- Thermal paste dried out
- Fan failure
Solutions:
- Clean system
- Improve case airflow
- Underclock GPU slightly
- Replace thermal paste
- Replace fans
Performance Degradation
Symptoms:
- Training slower than before
- Lower GPU utilization
- Longer iteration times
Causes:
- Thermal throttling
- Driver issues
- Background processes
- Storage degradation
Diagnostics:
# Check throttling
nvidia-smi -q -d PERFORMANCE
# Monitor clocks
nvidia-smi -q -d CLOCK
# Check for background processes
htop
Crashes During Training
Potential causes:
- Power delivery issues (PSU)
- Overheating
- RAM errors
- Driver bugs
- Overclocking instability
Solutions:
- Test with different PSU
- Monitor temps closely
- Run memtest86+
- Update/downgrade drivers
- Reset to stock clocks
Data Backup
Critical data to backup:
- Model checkpoints
- Training configs
- Custom code
- Processed datasets (if expensive to regenerate)
Backup strategies:
# Local backup to external drive
rsync -av --progress /path/to/models /mnt/external/backup/
# Cloud backup (rclone to Google Drive, S3, etc.)
rclone sync /path/to/models remote:backup/
# Automated daily backup
# Add to crontab
0 2 * * * rsync -av /path/to/models /mnt/external/backup/
Monitoring Stack Setup
Recommended setup:
- Basic: nvidia-smi + htop
- Intermediate: nvtop + Netdata web dashboard
- Advanced: Prometheus + Grafana + TensorBoard
Next Steps
- Monitor GPU health with
nvidia-smiandnvtop - Set up temperature alerts (see scripts above)
- Enable remote access via SSH
- Configure web dashboards for long-term tracking
:::tip[Start Simple] Begin with basic monitoring (nvidia-smi, htop) and add more sophisticated tools as needed. Don’t over-engineer your monitoring setup initially. :::