SLURM & HPC Clusters
Use SLURM workload manager on HPC clusters for deep learning. Submit GPU jobs, manage resources, monitor queue status, and optimize job scheduling for AI training.
Overview
SLURM (Simple Linux Utility for Resource Management) is the most widely used open-source job scheduler for HPC clusters. It manages:
- Job queuing and scheduling
- Resource allocation (CPUs, GPUs, memory)
- Job monitoring and accounting
- Fair-share scheduling
This guide covers SLURM usage for deep learning workloads on HPC clusters.
SLURM Basics
Job Submission
#!/bin/bash
#SBATCH --job-name=my_training
#SBATCH --output=logs/%x_%j.out
#SBATCH --error=logs/%x_%j.err
#SBATCH --time=24:00:00
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=4
#SBATCH --mem=32G
# Your commands here
module load python/3.10 cuda/12.1
source ~/venvs/ml/bin/activate
python train.py --epochs 100Submit with: sbatch submit_job.sh
# Request interactive session
srun --partition=gpu --gres=gpu:1 --mem=16G --cpus-per-task=4 \
--time=2:00:00 --pty bash
# Or with salloc
salloc --partition=gpu --gres=gpu:1 --time=2:00:00 # Submit simple command directly
sbatch --partition=gpu --gres=gpu:1 --wrap="python train.py" Common SBATCH Directives
| Directive | Description | Example |
|---|---|---|
--job-name | Job name | --job-name=training |
--output | stdout file | --output=logs/%x_%j.out |
--error | stderr file | --error=logs/%x_%j.err |
--time | Time limit | --time=24:00:00 (24h) |
--partition | Queue/partition | --partition=gpu |
--gres | Generic resources | --gres=gpu:2 (2 GPUs) |
--cpus-per-task | CPU cores | --cpus-per-task=8 |
--mem | Memory | --mem=64G |
--nodes | Number of nodes | --nodes=2 |
--ntasks | Number of tasks | --ntasks=4 |
:::tip[%x and %j Placeholders]
%x= job name%j= job ID%u= username%N= node name
Example: logs/%x_%j.out → logs/training_12345.out
:::
GPU Job Examples
Single GPU Training
#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=4
#SBATCH --mem=32G
#SBATCH --time=48:00:00
#SBATCH --job-name=single_gpu_train
module load cuda/12.1 cudnn/8.9
source ~/venvs/pytorch/bin/activate
python train.py \
--model resnet50 \
--batch-size 128 \
--epochs 100
Multi-GPU (Single Node)
#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --gres=gpu:4 # Request 4 GPUs
#SBATCH --cpus-per-task=16 # 4 CPUs per GPU
#SBATCH --mem=128G
#SBATCH --time=72:00:00
#SBATCH --job-name=multi_gpu_ddp
module load cuda/12.1
source ~/venvs/pytorch/bin/activate
# PyTorch DistributedDataParallel
torchrun --standalone --nnodes=1 --nproc_per_node=4 \
train.py --distributed
Multi-Node Multi-GPU
#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --nodes=2 # 2 nodes
#SBATCH --ntasks-per-node=1 # 1 task per node
#SBATCH --gres=gpu:4 # 4 GPUs per node
#SBATCH --cpus-per-task=16
#SBATCH --mem=128G
#SBATCH --time=96:00:00
# Get master node address
export MASTER_ADDR=$(scontrol show hostname $SLURM_NODELIST | head -n 1)
export MASTER_PORT=29500
# Launch distributed training
srun torchrun \
--nnodes=$SLURM_NNODES \
--nproc_per_node=4 \
--node_rank=$SLURM_NODEID \
--master_addr=$MASTER_ADDR \
--master_port=$MASTER_PORT \
train.py --distributed
Job Management
Monitor Jobs
# List your jobs
squeue -u $USER
# With more details
squeue -u $USER -o "%.18i %.9P %.30j %.8T %.10M %.6D %R"
# Watch in real-time
watch -n 1 squeue -u $USER # Job details
scontrol show job JOBID
# Job efficiency (after completion)
seff JOBID
# Job steps
sacct -j JOBID --format=JobID,JobName,Partition,State,Elapsed,MaxRSS
# Real-time job stats
sstat -j JOBID --format=JobID,MaxRSS,AveCPU # GPU partition status
sinfo -p gpu
# GPU usage across cluster
squeue -p gpu -o "%.18i %.9P %.8u %.2t %.10M %.6D %R %b"
# Available GPUs
sinfo -p gpu -o "%n %G %C" Control Jobs
# Cancel job
scancel JOBID
# Cancel all your jobs
scancel -u $USER
# Cancel jobs by name
scancel --name=training
# Hold job (prevent from starting)
scontrol hold JOBID
# Release held job
scontrol release JOBID
# Update job (before it starts)
scontrol update JobId=JOBID TimeLimit=48:00:00
Job Arrays
Run multiple similar jobs efficiently:
#!/bin/bash
#SBATCH --array=0-9 # 10 jobs: indices 0-9
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --time=24:00:00
#SBATCH --job-name=sweep
#SBATCH --output=logs/sweep_%A_%a.out
# Learning rates to test
LRS=(0.1 0.01 0.001 0.0001 0.00001 0.1 0.01 0.001 0.0001 0.00001)
# Get learning rate for this array task
LR=${LRS[$SLURM_ARRAY_TASK_ID]}
# Run training with this learning rate
python train.py --lr $LR --output_dir results/lr_$LR
Submit: sbatch job_array.sh
Manage array:
# Check array jobs
squeue -u $USER -r
# Cancel specific array task
scancel JOBID_3
# Cancel entire array
scancel JOBID
Advanced Features
Job Dependencies
# Job 1: Preprocess data
JOB1=$(sbatch --parsable preprocess.sh)
# Job 2: Train (waits for Job 1)
JOB2=$(sbatch --dependency=afterok:$JOB1 train.sh)
# Job 3: Evaluate (waits for Job 2)
sbatch --dependency=afterok:$JOB2 evaluate.sh # Launch multiple training jobs
JOB1=$(sbatch --parsable train_fold1.sh)
JOB2=$(sbatch --parsable train_fold2.sh)
JOB3=$(sbatch --parsable train_fold3.sh)
# Merge results after all complete
sbatch --dependency=afterok:$JOB1:$JOB2:$JOB3 merge_results.sh Email Notifications
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH [email protected]
Checkpoint and Resume
#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --time=24:00:00
#SBATCH --signal=B:USR1@600 # Send signal 10 min before timeout
# Checkpoint handler
checkpoint() {
echo "Checkpointing..."
# Your checkpoint save code
touch checkpoint_signal
}
trap checkpoint USR1
# Run training
python train.py --resume_if_exists
# If time runs out, automatically resubmit
if [ -f checkpoint_signal ]; then
sbatch $0 # Resubmit this script
fi
Resource Optimization
Check Job Efficiency
# After job completes
seff JOBID
Example output:
Job ID: 123456
Cluster: mycluster
User/Group: user/group
State: COMPLETED (exit code 0)
Cores: 4
CPU Utilized: 23:45:30
CPU Efficiency: 98.52% of 24:06:00 core-walltime
Memory Utilized: 28.5 GB
Memory Efficiency: 89.06% of 32.0 GB
Right-Size Resources
# Start with conservative estimate
#SBATCH --time=4:00:00
#SBATCH --mem=16G
# Check actual usage with seff
# Adjust for production run # SSH to compute node
squeue -u $USER # Get node name
ssh compute-node-01
# Check resources
nvidia-smi
htop Best Practices
- Test with Short Jobs - Debug with
--time=1:00:00first - Request Exact GPUs - Use
--gres=gpu:a100:2for specific GPU types - Use Job Arrays - For parameter sweeps instead of many separate jobs
- Checkpoint Frequently - Save progress every epoch or hour
- Monitor Efficiency - Use
seffto optimize resource requests - Clean Up - Remove old output files and checkpoints
:::caution[Fair Usage]
- Don’t submit hundreds of jobs at once
- Use appropriate time limits (don’t request 7 days if you need 4 hours)
- Don’t hog all GPUs - leave some for others
- Clean up scratch space regularly :::
Troubleshooting
Job Pending Forever
# Why is job pending?
squeue -j JOBID --start
# Check partition limits
scontrol show partition gpu
# Check your limits
sacctmgr show assoc where user=$USER format=user,account,partition,maxjobs,maxsubmit
Out of Memory
# Check actual memory usage
seff JOBID
# Increase memory in job script
#SBATCH --mem=64G
# Or memory per CPU
#SBATCH --mem-per-cpu=4G
Job Killed Without Error
# Check job output
cat slurm-JOBID.out
# Check system logs
sacct -j JOBID --format=JobID,State,ExitCode,DerivedExitCode
# Common causes:
# - Out of memory (OOM)
# - Time limit exceeded
# - Node failure
Useful Commands Reference
# Submit job
sbatch script.sh
# List jobs
squeue -u $USER
# Cancel job
scancel JOBID
# Job details
scontrol show job JOBID
# Job efficiency
seff JOBID
# Interactive session
srun --pty bash
# Cluster info
sinfo
# Your account info
sacctmgr show user $USER
# Job history
sacct -u $USER --starttime=2025-01-01
Additional Resources
Official SLURM Documentation
- GWDG Cluster Guide - Institution-specific info
- Multi-GPU Training - Distributed training
- Training Utilities - Job management scripts
- Backup & Sync - Data transfer to/from clusters
:::tip[Quick Start Checklist]
- ✅ Get cluster account
- ✅ Set up SSH keys
- ✅ Test with small interactive job (
srun) - ✅ Write batch script for your workload
- ✅ Submit and monitor with
squeue - ✅ Check efficiency with
seff - ✅ Scale up to production runs :::