GWDG HPC Resources

Access GWDG HPC cluster resources for deep learning. Learn about GPU nodes, job scheduling, available courses, and compute allocations for AI research at University of Göttingen.

Overview

The GWDG (Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen) is a joint data and IT service center for the University of Göttingen and the Max Planck Society. It provides:

High-performance computing (HPC) clusters
GPU resources for deep learning
Training and courses
Research collaboration

Getting Started

Access Requirements

Account Registration - Apply through your institution
SSH Key Setup - Required for cluster access
Course Completion - Recommended for efficient usage

Cluster Access

# Using SSH key authentication
ssh -i $HOME/.ssh/id_rsa_nhr -l username glogin.hlrn.de

# Example with specific key and user
ssh -i $HOME/.ssh/id_ed25519 [email protected]

Replace:

id_ed25519 with your key filename
u10000 with your actual username
glogin9.hlrn.de with your assigned login node

# Add to ~/.ssh/config for easier access
Host gwdg
    HostName glogin9.hlrn.de
    User u10000
    IdentityFile ~/.ssh/id_ed25519
    ServerAliveInterval 60
    ServerAliveCountMax 3

# Now just use: ssh gwdg

# Generate ed25519 key (recommended)
ssh-keygen -t ed25519 -f ~/.ssh/id_gwdg -C "[email protected]"

# Or RSA key
ssh-keygen -t rsa -b 4096 -f ~/.ssh/id_rsa_gwdg

# Copy public key to GWDG
# (Submit through GWDG portal or email)
cat ~/.ssh/id_gwdg.pub

GWDG Training Courses

Deep Learning with GPUs

Comprehensive course on GPU computing for deep learning on HPC clusters.

Topics Covered:

GPU architecture and CUDA basics
Deep learning frameworks (PyTorch, TensorFlow)
Batch job submission with GPUs
Multi-GPU training strategies
Performance optimization

Deep Learning with GPUs Course

external

Scientific Computing on Clusters

Practical introduction to HPC cluster usage.

Topics Covered:

Cluster architecture
SLURM job scheduler
Resource allocation
Module system
Data management

Scientific Computing Cluster Course

external

:::tip[Highly Recommended] Complete these courses before starting intensive GPU workloads. They cover best practices specific to GWDG infrastructure and will save you time troubleshooting. :::

Common GWDG Workflows

Submit GPU Job

#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --time=24:00:00
#SBATCH --mem=32G
#SBATCH --cpus-per-task=4
#SBATCH --job-name=my_training
#SBATCH --output=logs/%x_%j.out

# Load modules
module load python/3.10
module load cuda/12.1

# Activate environment
source $HOME/venvs/ml/bin/activate

# Run training
python train.py --epochs 100 --batch-size 64

Interactive GPU Session

# Request interactive session with GPU
srun --partition=gpu --gres=gpu:1 --mem=16G --cpus-per-task=4 --time=2:00:00 --pty bash

# Check GPU availability
nvidia-smi

# Run interactive work
python

Check Job Status

# Your jobs
squeue -u $USER

# Specific job details
scontrol show job JOBID

# Job history
sacct -u $USER --format=JobID,JobName,State,Elapsed,MaxRSS

Data Management

Transfer Data to GWDG

# Upload dataset
rsync -avhP --progress \
  /local/datasets/imagenet \
  [email protected]:/work/datasets/

# Download results
rsync -avhP --progress \
  [email protected]:/work/experiments/checkpoints \
  ./local_backups/

# Upload file
scp -i ~/.ssh/id_gwdg large_dataset.tar.gz \
  [email protected]:/work/datasets/

# Download file
scp -i ~/.ssh/id_gwdg \
  [email protected]:/work/results/model.pth \
  ./local/

# Connect with sftp
sftp -i ~/.ssh/id_gwdg [email protected]

# SFTP commands
put local_file.txt         # Upload
get remote_file.txt        # Download
ls                         # List remote files
lcd /local/path            # Change local directory

Storage Locations

# Home directory (limited space)
$HOME

# Work directory (larger quota)
/work/$USER

# Scratch for temporary files
/scratch/$USER

# Check quotas
quota -s

Module System

# List available modules
module avail

# Search for specific module
module avail cuda

# Load modules
module load python/3.10 cuda/12.1 cudnn/8.9

# List loaded modules
module list

# Unload module
module unload cuda

# Purge all modules
module purge

Best Practices

Test Locally First - Debug on small datasets before cluster submission
Use Batch Jobs - Don’t run long jobs on login nodes
Monitor Resources - Use seff JOBID to check efficiency
Clean Up - Remove old data from scratch regularly
Checkpointing - Save progress frequently for long jobs

:::caution[Login Nodes] Login nodes are shared resources. Never run computationally intensive tasks on login nodes. Always submit batch jobs or request interactive sessions. :::

Troubleshooting

Job Won’t Start

# Check queue
squeue -p gpu

# Check job priority
sprio -j JOBID

# Explain why job is pending
squeue -j JOBID --start

Out of Memory Errors

# Check actual memory usage
seff JOBID

# Request more memory in job script
#SBATCH --mem=64G

Connection Issues

# Test connection
ssh -v [email protected]

# Check key permissions
chmod 600 ~/.ssh/id_gwdg
chmod 644 ~/.ssh/id_gwdg.pub

Additional Resources

GWDG Official Documentation
SLURM Guide - General HPC usage
Multi-GPU Training - Distributed training
Backup & Sync - Data transfer strategies

:::note[Institution-Specific] This guide focuses on GWDG infrastructure. Other HPC centers may have different configurations, but general principles apply. :::