GWDG HPC Resources
Access GWDG HPC cluster resources for deep learning. Learn about GPU nodes, job scheduling, available courses, and compute allocations for AI research at University of Göttingen.
Overview
The GWDG (Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen) is a joint data and IT service center for the University of Göttingen and the Max Planck Society. It provides:
- High-performance computing (HPC) clusters
- GPU resources for deep learning
- Training and courses
- Research collaboration
Getting Started
Access Requirements
- Account Registration - Apply through your institution
- SSH Key Setup - Required for cluster access
- Course Completion - Recommended for efficient usage
Cluster Access
# Using SSH key authentication
ssh -i $HOME/.ssh/id_rsa_nhr -l username glogin.hlrn.de
# Example with specific key and user
ssh -i $HOME/.ssh/id_ed25519 [email protected]Replace:
id_ed25519with your key filenameu10000with your actual usernameglogin9.hlrn.dewith your assigned login node
# Add to ~/.ssh/config for easier access
Host gwdg
HostName glogin9.hlrn.de
User u10000
IdentityFile ~/.ssh/id_ed25519
ServerAliveInterval 60
ServerAliveCountMax 3
# Now just use: ssh gwdg # Generate ed25519 key (recommended)
ssh-keygen -t ed25519 -f ~/.ssh/id_gwdg -C "[email protected]"
# Or RSA key
ssh-keygen -t rsa -b 4096 -f ~/.ssh/id_rsa_gwdg
# Copy public key to GWDG
# (Submit through GWDG portal or email)
cat ~/.ssh/id_gwdg.pub GWDG Training Courses
Deep Learning with GPUs
Comprehensive course on GPU computing for deep learning on HPC clusters.
Topics Covered:
- GPU architecture and CUDA basics
- Deep learning frameworks (PyTorch, TensorFlow)
- Batch job submission with GPUs
- Multi-GPU training strategies
- Performance optimization
Deep Learning with GPUs Course
Scientific Computing on Clusters
Practical introduction to HPC cluster usage.
Topics Covered:
- Cluster architecture
- SLURM job scheduler
- Resource allocation
- Module system
- Data management
Scientific Computing Cluster Course
:::tip[Highly Recommended] Complete these courses before starting intensive GPU workloads. They cover best practices specific to GWDG infrastructure and will save you time troubleshooting. :::
Common GWDG Workflows
Submit GPU Job
#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --time=24:00:00
#SBATCH --mem=32G
#SBATCH --cpus-per-task=4
#SBATCH --job-name=my_training
#SBATCH --output=logs/%x_%j.out
# Load modules
module load python/3.10
module load cuda/12.1
# Activate environment
source $HOME/venvs/ml/bin/activate
# Run training
python train.py --epochs 100 --batch-size 64
Interactive GPU Session
# Request interactive session with GPU
srun --partition=gpu --gres=gpu:1 --mem=16G --cpus-per-task=4 --time=2:00:00 --pty bash
# Check GPU availability
nvidia-smi
# Run interactive work
python
Check Job Status
# Your jobs
squeue -u $USER
# Specific job details
scontrol show job JOBID
# Job history
sacct -u $USER --format=JobID,JobName,State,Elapsed,MaxRSS
Data Management
Transfer Data to GWDG
# Upload dataset
rsync -avhP --progress \
/local/datasets/imagenet \
[email protected]:/work/datasets/
# Download results
rsync -avhP --progress \
[email protected]:/work/experiments/checkpoints \
./local_backups/ # Upload file
scp -i ~/.ssh/id_gwdg large_dataset.tar.gz \
[email protected]:/work/datasets/
# Download file
scp -i ~/.ssh/id_gwdg \
[email protected]:/work/results/model.pth \
./local/ # Connect with sftp
sftp -i ~/.ssh/id_gwdg [email protected]
# SFTP commands
put local_file.txt # Upload
get remote_file.txt # Download
ls # List remote files
lcd /local/path # Change local directory Storage Locations
# Home directory (limited space)
$HOME
# Work directory (larger quota)
/work/$USER
# Scratch for temporary files
/scratch/$USER
# Check quotas
quota -s
Module System
# List available modules
module avail
# Search for specific module
module avail cuda
# Load modules
module load python/3.10 cuda/12.1 cudnn/8.9
# List loaded modules
module list
# Unload module
module unload cuda
# Purge all modules
module purge
Best Practices
- Test Locally First - Debug on small datasets before cluster submission
- Use Batch Jobs - Don’t run long jobs on login nodes
- Monitor Resources - Use
seff JOBIDto check efficiency - Clean Up - Remove old data from scratch regularly
- Checkpointing - Save progress frequently for long jobs
:::caution[Login Nodes] Login nodes are shared resources. Never run computationally intensive tasks on login nodes. Always submit batch jobs or request interactive sessions. :::
Troubleshooting
Job Won’t Start
# Check queue
squeue -p gpu
# Check job priority
sprio -j JOBID
# Explain why job is pending
squeue -j JOBID --start
Out of Memory Errors
# Check actual memory usage
seff JOBID
# Request more memory in job script
#SBATCH --mem=64G
Connection Issues
# Test connection
ssh -v [email protected]
# Check key permissions
chmod 600 ~/.ssh/id_gwdg
chmod 644 ~/.ssh/id_gwdg.pub
Additional Resources
- GWDG Official Documentation
- SLURM Guide - General HPC usage
- Multi-GPU Training - Distributed training
- Backup & Sync - Data transfer strategies
:::note[Institution-Specific] This guide focuses on GWDG infrastructure. Other HPC centers may have different configurations, but general principles apply. :::