Backup & Sync Scripts

Backup scripts for deep learning datasets, trained models, and checkpoints. Sync files across systems with rsync, manage cloud backups, and automate data protection workflows.

Overview

Essential for protecting your work:

Dataset backups - Protect valuable preprocessed data
Model checkpoints - Save training progress
Code synchronization - Keep multiple machines in sync
Remote backups - Off-site redundancy

Dataset Backup Scripts

Basic rsync Backup

#!/bin/bash
# Backup dataset to external drive or network storage

SOURCE="/data/datasets/imagenet"
DEST="/mnt/backup/datasets/imagenet"

# Create backup with progress
rsync -avh --progress "$SOURCE/" "$DEST/"

# With compression (slower but smaller)
rsync -avhz --progress "$SOURCE/" "$DEST/"

# Exclude certain files
rsync -avh --progress \
    --exclude='*.tmp' \
    --exclude='.cache' \
    "$SOURCE/" "$DEST/"

Incremental Backup

#!/bin/bash
# Faster backups - only sync changes

SOURCE="/data/datasets"
DEST="/mnt/backup/datasets"
LOG="/var/log/dataset_backup.log"

# Incremental with deletion of removed files
rsync -avh --progress \
    --delete \
    --log-file="$LOG" \
    "$SOURCE/" "$DEST/"

echo "Backup completed: $(date)" >> "$LOG"

Compressed Archive Backup

#!/bin/bash
# Create compressed backup archive

DATASET="/data/datasets/coco"
BACKUP_DIR="/mnt/backup"
DATE=$(date +%Y%m%d)

# Create compressed tar archive
tar -czf "$BACKUP_DIR/coco_${DATE}.tar.gz" \
    -C /data/datasets coco

# With progress (requires pv)
tar -czf - -C /data/datasets coco | \
    pv -s $(du -sb /data/datasets/coco | awk '{print $1}') > \
    "$BACKUP_DIR/coco_${DATE}.tar.gz"

#!/bin/bash
# Split backup into multiple files (useful for cloud upload limits)

DATASET="/data/datasets/imagenet"
BACKUP_DIR="/mnt/backup"
DATE=$(date +%Y%m%d)
SPLIT_SIZE="10G"  # 10GB chunks

# Create and split archive
tar -czf - "$DATASET" | \
    split -b "$SPLIT_SIZE" - \
    "$BACKUP_DIR/imagenet_${DATE}.tar.gz.part"

# Restore with:
# cat imagenet_20250101.tar.gz.part* | tar -xzf -

Model Checkpoint Management

Automatic Checkpoint Backup

#!/bin/bash
# Continuously backup training checkpoints

SOURCE="/home/user/experiments/model_v1/checkpoints"
DEST="/mnt/backup/checkpoints/model_v1"

# Watch and sync checkpoints as they're created
while true; do
    rsync -avh --progress \
        --include='*.pt' \
        --include='*.pth' \
        --include='*.ckpt' \
        --exclude='*' \
        "$SOURCE/" "$DEST/"

    # Check every 5 minutes
    sleep 300
done

Keep Only Best Checkpoints

#!/bin/bash
# Keep only the 5 most recent checkpoints to save space

CHECKPOINT_DIR="/home/user/experiments/checkpoints"
KEEP_COUNT=5

# Find and keep only N most recent .pt files
cd "$CHECKPOINT_DIR"
ls -t *.pt | tail -n +$((KEEP_COUNT + 1)) | xargs -r rm

echo "Kept $KEEP_COUNT most recent checkpoints"

Checkpoint Sync Script

#!/usr/bin/env python3
"""
Sync only best checkpoints based on metrics
"""
import os
import shutil
import torch
from pathlib import Path

def sync_best_checkpoints(source_dir, dest_dir, metric='val_loss', keep_best=5):
    """
    Copy only the best N checkpoints based on metric
    """
    checkpoints = []

    # Find all checkpoints
    for ckpt_path in Path(source_dir).glob('*.pt'):
        try:
            # Load checkpoint metadata
            ckpt = torch.load(ckpt_path, map_location='cpu')
            if metric in ckpt:
                checkpoints.append((ckpt_path, ckpt[metric]))
        except:
            continue

    # Sort by metric (assuming lower is better for loss)
    checkpoints.sort(key=lambda x: x[1])

    # Copy best N checkpoints
    Path(dest_dir).mkdir(parents=True, exist_ok=True)

    for ckpt_path, metric_val in checkpoints[:keep_best]:
        dest_path = Path(dest_dir) / ckpt_path.name
        shutil.copy2(ckpt_path, dest_path)
        print(f"Copied {ckpt_path.name} ({metric}={metric_val:.4f})")

if __name__ == "__main__":
    sync_best_checkpoints(
        source_dir="/experiments/checkpoints",
        dest_dir="/backup/checkpoints",
        metric="val_loss",
        keep_best=5
    )

Remote Synchronization

Sync to Remote Server

#!/bin/bash
# Sync local files to remote server via SSH

LOCAL_DIR="/home/user/datasets"
REMOTE_USER="username"
REMOTE_HOST="remote.server.com"
REMOTE_DIR="/data/datasets"

# Sync to remote
rsync -avhz --progress \
    -e "ssh -p 22" \
    "$LOCAL_DIR/" \
    "${REMOTE_USER}@${REMOTE_HOST}:${REMOTE_DIR}/"

# With bandwidth limit (useful for slow connections)
rsync -avhz --progress \
    --bwlimit=10000 \  # 10 MB/s limit
    -e "ssh -p 22" \
    "$LOCAL_DIR/" \
    "${REMOTE_USER}@${REMOTE_HOST}:${REMOTE_DIR}/"

#!/bin/bash
# Bidirectional sync using unison

LOCAL_DIR="/home/user/code"
REMOTE="ssh://[email protected]//home/user/code"

# Install unison first: sudo apt install unison

# Sync both directions
unison "$LOCAL_DIR" "$REMOTE" \
    -auto \
    -batch \
    -ignore 'Path .git' \
    -ignore 'Path __pycache__' \
    -ignore 'Path *.pyc'

Cloud Backup (AWS S3)

#!/bin/bash
# Sync datasets to AWS S3 for cloud backup

# Install AWS CLI: pip install awscli

LOCAL_DIR="/data/datasets/imagenet"
S3_BUCKET="s3://my-ml-backups/datasets/imagenet"

# Sync to S3 (only upload new/changed files)
aws s3 sync "$LOCAL_DIR" "$S3_BUCKET" \
    --storage-class GLACIER \  # Use cheaper storage for backups
    --exclude "*.tmp" \
    --exclude ".cache/*"

# Download from S3
# aws s3 sync "$S3_BUCKET" "$LOCAL_DIR"

Automated Backup Workflows

Cron Job Setup

# Edit crontab
crontab -e

# Add these lines for automated backups:

# Daily dataset backup at 2 AM
0 2 * * * /home/user/scripts/backup_dataset.sh >> /var/log/backup.log 2>&1

# Hourly checkpoint sync
0 * * * * /home/user/scripts/backup_checkpoints.sh >> /var/log/checkpoints.log 2>&1

# Weekly remote sync on Sundays at 3 AM
0 3 * * 0 /home/user/scripts/sync_to_remote.sh >> /var/log/remote_sync.log 2>&1

Complete Backup Script

#!/bin/bash
# Complete backup solution with logging and notifications

# Configuration
DATASETS="/data/datasets"
CHECKPOINTS="/home/user/experiments"
CODE="/home/user/code"
BACKUP_ROOT="/mnt/backup"
DATE=$(date +%Y%m%d_%H%M%S)
LOG_FILE="/var/log/backup_${DATE}.log"

# Email for notifications (requires mail configured)
NOTIFY_EMAIL="[email protected]"

# Logging function
log() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}

# Start backup
log "Starting backup process..."

# 1. Backup datasets
log "Backing up datasets..."
rsync -avh --progress \
    "$DATASETS/" "$BACKUP_ROOT/datasets/" \
    >> "$LOG_FILE" 2>&1

# 2. Backup checkpoints
log "Backing up checkpoints..."
rsync -avh --progress \
    --include='*.pt' --include='*.pth' --include='*.ckpt' \
    "$CHECKPOINTS/" "$BACKUP_ROOT/checkpoints/" \
    >> "$LOG_FILE" 2>&1

# 3. Backup code
log "Backing up code..."
rsync -avh --progress \
    --exclude='.git' --exclude='__pycache__' \
    "$CODE/" "$BACKUP_ROOT/code/" \
    >> "$LOG_FILE" 2>&1

# 4. Create compressed archive of critical files
log "Creating compressed archive..."
tar -czf "$BACKUP_ROOT/archives/critical_${DATE}.tar.gz" \
    "$CODE" "$CHECKPOINTS" \
    >> "$LOG_FILE" 2>&1

# Check if backup succeeded
if [ $? -eq 0 ]; then
    log "Backup completed successfully!"
    echo "Backup successful: $DATE" | mail -s "Backup Success" "$NOTIFY_EMAIL"
else
    log "ERROR: Backup failed!"
    echo "Backup FAILED: $DATE. Check $LOG_FILE" | mail -s "Backup FAILED" "$NOTIFY_EMAIL"
    exit 1
fi

# Cleanup old backups (keep last 7 days)
log "Cleaning up old backups..."
find "$BACKUP_ROOT/archives" -name "critical_*.tar.gz" -mtime +7 -delete

log "Backup process complete."

Dataset Versioning

Create Dataset Snapshots

#!/bin/bash
# Create versioned snapshots of datasets

DATASET="/data/datasets/my_dataset"
SNAPSHOT_DIR="/data/snapshots"
VERSION=$(date +%Y%m%d)

# Create hard-link snapshot (fast, space-efficient)
cp -al "$DATASET" "$SNAPSHOT_DIR/my_dataset_v${VERSION}"

echo "Created snapshot: my_dataset_v${VERSION}"

Git-LFS for Code + Small Data

# For projects with code + small datasets
git lfs install

# Track large files
git lfs track "*.pth"
git lfs track "*.h5"
git lfs track "data/*.csv"

# Commit and push
git add .gitattributes
git commit -m "Setup Git LFS"
git push

Monitoring & Verification

Verify Backup Integrity

#!/bin/bash
# Verify backup matches source

SOURCE="/data/datasets/imagenet"
BACKUP="/mnt/backup/datasets/imagenet"

# Compare directories
rsync -avhn --delete "$SOURCE/" "$BACKUP/" | grep -v "^$" > /tmp/backup_diff.txt

if [ -s /tmp/backup_diff.txt ]; then
    echo "WARNING: Backup differs from source!"
    cat /tmp/backup_diff.txt
    exit 1
else
    echo "Backup verified: matches source."
    exit 0
fi

Check Backup Age

#!/bin/bash
# Alert if last backup is older than N days

BACKUP_DIR="/mnt/backup/datasets"
MAX_AGE_DAYS=2

LAST_BACKUP=$(find "$BACKUP_DIR" -type f -printf '%T@ %p\n' | sort -n | tail -1 | cut -f2- -d" ")
BACKUP_AGE_DAYS=$(( ($(date +%s) - $(stat -c %Y "$LAST_BACKUP")) / 86400 ))

if [ $BACKUP_AGE_DAYS -gt $MAX_AGE_DAYS ]; then
    echo "WARNING: Last backup is $BACKUP_AGE_DAYS days old (threshold: $MAX_AGE_DAYS)"
    exit 1
else
    echo "Backup is up to date ($BACKUP_AGE_DAYS days old)"
    exit 0
fi

Best Practices

3-2-1 Rule: 3 copies, 2 different media, 1 offsite
Automate: Use cron for regular backups
Verify: Test restoring from backups periodically
Monitor: Set up alerts for backup failures
Compress: Use compression for long-term storage
Version: Keep multiple versions of critical data

:::tip[Test Your Backups] A backup you haven’t tested restoring from is not a backup! Regularly practice restoring data from your backups. :::

:::caution[Storage Space] Monitor backup storage usage. Compressed datasets and checkpoint cleanup scripts help manage space. :::

File Permissions - Set correct backup permissions
Dataset Scripts - Download and manage datasets
Training Utilities - Checkpoint management during training
HPC Storage - Backup strategies for clusters