SLURM & HPC Clusters

Use SLURM workload manager on HPC clusters for deep learning. Submit GPU jobs, manage resources, monitor queue status, and optimize job scheduling for AI training.

Overview

SLURM (Simple Linux Utility for Resource Management) is the most widely used open-source job scheduler for HPC clusters. It manages:

Job queuing and scheduling
Resource allocation (CPUs, GPUs, memory)
Job monitoring and accounting
Fair-share scheduling

This guide covers SLURM usage for deep learning workloads on HPC clusters.

SLURM Basics

Job Submission

#!/bin/bash
#SBATCH --job-name=my_training
#SBATCH --output=logs/%x_%j.out
#SBATCH --error=logs/%x_%j.err
#SBATCH --time=24:00:00
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=4
#SBATCH --mem=32G

# Your commands here
module load python/3.10 cuda/12.1
source ~/venvs/ml/bin/activate
python train.py --epochs 100

Submit with: sbatch submit_job.sh

# Request interactive session
srun --partition=gpu --gres=gpu:1 --mem=16G --cpus-per-task=4 \
     --time=2:00:00 --pty bash

# Or with salloc
salloc --partition=gpu --gres=gpu:1 --time=2:00:00

# Submit simple command directly
sbatch --partition=gpu --gres=gpu:1 --wrap="python train.py"

Common SBATCH Directives

Directive	Description	Example
`--job-name`	Job name	`--job-name=training`
`--output`	stdout file	`--output=logs/%x_%j.out`
`--error`	stderr file	`--error=logs/%x_%j.err`
`--time`	Time limit	`--time=24:00:00` (24h)
`--partition`	Queue/partition	`--partition=gpu`
`--gres`	Generic resources	`--gres=gpu:2` (2 GPUs)
`--cpus-per-task`	CPU cores	`--cpus-per-task=8`
`--mem`	Memory	`--mem=64G`
`--nodes`	Number of nodes	`--nodes=2`
`--ntasks`	Number of tasks	`--ntasks=4`

:::tip[%x and %j Placeholders]

%x = job name
%j = job ID
%u = username
%N = node name

Example: logs/%x_%j.out → logs/training_12345.out :::

GPU Job Examples

Single GPU Training

#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=4
#SBATCH --mem=32G
#SBATCH --time=48:00:00
#SBATCH --job-name=single_gpu_train

module load cuda/12.1 cudnn/8.9
source ~/venvs/pytorch/bin/activate

python train.py \
    --model resnet50 \
    --batch-size 128 \
    --epochs 100

Multi-GPU (Single Node)

#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --gres=gpu:4          # Request 4 GPUs
#SBATCH --cpus-per-task=16     # 4 CPUs per GPU
#SBATCH --mem=128G
#SBATCH --time=72:00:00
#SBATCH --job-name=multi_gpu_ddp

module load cuda/12.1
source ~/venvs/pytorch/bin/activate

# PyTorch DistributedDataParallel
torchrun --standalone --nnodes=1 --nproc_per_node=4 \
    train.py --distributed

Multi-Node Multi-GPU

#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --nodes=2              # 2 nodes
#SBATCH --ntasks-per-node=1    # 1 task per node
#SBATCH --gres=gpu:4           # 4 GPUs per node
#SBATCH --cpus-per-task=16
#SBATCH --mem=128G
#SBATCH --time=96:00:00

# Get master node address
export MASTER_ADDR=$(scontrol show hostname $SLURM_NODELIST | head -n 1)
export MASTER_PORT=29500

# Launch distributed training
srun torchrun \
    --nnodes=$SLURM_NNODES \
    --nproc_per_node=4 \
    --node_rank=$SLURM_NODEID \
    --master_addr=$MASTER_ADDR \
    --master_port=$MASTER_PORT \
    train.py --distributed

Job Management

Monitor Jobs

# List your jobs
squeue -u $USER

# With more details
squeue -u $USER -o "%.18i %.9P %.30j %.8T %.10M %.6D %R"

# Watch in real-time
watch -n 1 squeue -u $USER

# Job details
scontrol show job JOBID

# Job efficiency (after completion)
seff JOBID

# Job steps
sacct -j JOBID --format=JobID,JobName,Partition,State,Elapsed,MaxRSS

# Real-time job stats
sstat -j JOBID --format=JobID,MaxRSS,AveCPU

# GPU partition status
sinfo -p gpu

# GPU usage across cluster
squeue -p gpu -o "%.18i %.9P %.8u %.2t %.10M %.6D %R %b"

# Available GPUs
sinfo -p gpu -o "%n %G %C"

Control Jobs

# Cancel job
scancel JOBID

# Cancel all your jobs
scancel -u $USER

# Cancel jobs by name
scancel --name=training

# Hold job (prevent from starting)
scontrol hold JOBID

# Release held job
scontrol release JOBID

# Update job (before it starts)
scontrol update JobId=JOBID TimeLimit=48:00:00

Job Arrays

Run multiple similar jobs efficiently:

#!/bin/bash
#SBATCH --array=0-9           # 10 jobs: indices 0-9
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --time=24:00:00
#SBATCH --job-name=sweep
#SBATCH --output=logs/sweep_%A_%a.out

# Learning rates to test
LRS=(0.1 0.01 0.001 0.0001 0.00001 0.1 0.01 0.001 0.0001 0.00001)

# Get learning rate for this array task
LR=${LRS[$SLURM_ARRAY_TASK_ID]}

# Run training with this learning rate
python train.py --lr $LR --output_dir results/lr_$LR

Submit: sbatch job_array.sh

Manage array:

# Check array jobs
squeue -u $USER -r

# Cancel specific array task
scancel JOBID_3

# Cancel entire array
scancel JOBID

Advanced Features

Job Dependencies

# Job 1: Preprocess data
JOB1=$(sbatch --parsable preprocess.sh)

# Job 2: Train (waits for Job 1)
JOB2=$(sbatch --dependency=afterok:$JOB1 train.sh)

# Job 3: Evaluate (waits for Job 2)
sbatch --dependency=afterok:$JOB2 evaluate.sh

# Launch multiple training jobs
JOB1=$(sbatch --parsable train_fold1.sh)
JOB2=$(sbatch --parsable train_fold2.sh)
JOB3=$(sbatch --parsable train_fold3.sh)

# Merge results after all complete
sbatch --dependency=afterok:$JOB1:$JOB2:$JOB3 merge_results.sh

Email Notifications

#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH [email protected]

Checkpoint and Resume

#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --time=24:00:00
#SBATCH --signal=B:USR1@600  # Send signal 10 min before timeout

# Checkpoint handler
checkpoint() {
    echo "Checkpointing..."
    # Your checkpoint save code
    touch checkpoint_signal
}

trap checkpoint USR1

# Run training
python train.py --resume_if_exists

# If time runs out, automatically resubmit
if [ -f checkpoint_signal ]; then
    sbatch $0  # Resubmit this script
fi

Resource Optimization

Check Job Efficiency

# After job completes
seff JOBID

Example output:

Job ID: 123456
Cluster: mycluster
User/Group: user/group
State: COMPLETED (exit code 0)
Cores: 4
CPU Utilized: 23:45:30
CPU Efficiency: 98.52% of 24:06:00 core-walltime
Memory Utilized: 28.5 GB
Memory Efficiency: 89.06% of 32.0 GB

Right-Size Resources

# Start with conservative estimate
#SBATCH --time=4:00:00
#SBATCH --mem=16G

# Check actual usage with seff
# Adjust for production run

# SSH to compute node
squeue -u $USER  # Get node name
ssh compute-node-01

# Check resources
nvidia-smi
htop

Best Practices

Test with Short Jobs - Debug with --time=1:00:00 first
Request Exact GPUs - Use --gres=gpu:a100:2 for specific GPU types
Use Job Arrays - For parameter sweeps instead of many separate jobs
Checkpoint Frequently - Save progress every epoch or hour
Monitor Efficiency - Use seff to optimize resource requests
Clean Up - Remove old output files and checkpoints

:::caution[Fair Usage]

Don’t submit hundreds of jobs at once
Use appropriate time limits (don’t request 7 days if you need 4 hours)
Don’t hog all GPUs - leave some for others
Clean up scratch space regularly :::

Troubleshooting

Job Pending Forever

# Why is job pending?
squeue -j JOBID --start

# Check partition limits
scontrol show partition gpu

# Check your limits
sacctmgr show assoc where user=$USER format=user,account,partition,maxjobs,maxsubmit

Out of Memory

# Check actual memory usage
seff JOBID

# Increase memory in job script
#SBATCH --mem=64G

# Or memory per CPU
#SBATCH --mem-per-cpu=4G

Job Killed Without Error

# Check job output
cat slurm-JOBID.out

# Check system logs
sacct -j JOBID --format=JobID,State,ExitCode,DerivedExitCode

# Common causes:
# - Out of memory (OOM)
# - Time limit exceeded
# - Node failure

Useful Commands Reference

# Submit job
sbatch script.sh

# List jobs
squeue -u $USER

# Cancel job
scancel JOBID

# Job details
scontrol show job JOBID

# Job efficiency
seff JOBID

# Interactive session
srun --pty bash

# Cluster info
sinfo

# Your account info
sacctmgr show user $USER

# Job history
sacct -u $USER --starttime=2025-01-01

Additional Resources

Official SLURM Documentation

external

GWDG Cluster Guide - Institution-specific info
Multi-GPU Training - Distributed training
Training Utilities - Job management scripts
Backup & Sync - Data transfer to/from clusters

:::tip[Quick Start Checklist]

✅ Get cluster account
✅ Set up SSH keys
✅ Test with small interactive job (srun)
✅ Write batch script for your workload
✅ Submit and monitor with squeue
✅ Check efficiency with seff
✅ Scale up to production runs :::