Dataset Download Scripts

Download popular deep learning datasets with ready-to-use bash scripts. Get ImageNet, COCO, CIFAR, MNIST, and other common datasets with automated download and extraction.

Overview

This page provides optimized scripts for downloading popular datasets. These scripts include:

Parallel downloads for faster completion
Automatic extraction and cleanup
Progress tracking
Error handling

COCO Dataset

The Microsoft COCO (Common Objects in Context) dataset is widely used for object detection, segmentation, and captioning tasks.

Dataset Size: ~25GB (train), ~1GB (val), ~6GB (test)

#!/bin/bash
# Download COCO 2017 dataset with all splits and annotations

# Create directory structure
mkdir -p coco/images
cd coco/images

# Download images (in parallel for speed)
echo "Downloading images..."
wget -q --show-progress http://images.cocodataset.org/zips/train2017.zip &
wget -q --show-progress http://images.cocodataset.org/zips/val2017.zip &
wget -q --show-progress http://images.cocodataset.org/zips/test2017.zip &
wait

# Extract images
echo "Extracting images..."
unzip -q train2017.zip &
unzip -q val2017.zip &
unzip -q test2017.zip &
wait

# Clean up zips
rm train2017.zip val2017.zip test2017.zip

# Download annotations
cd ../
echo "Downloading annotations..."
wget -q --show-progress http://images.cocodataset.org/annotations/annotations_trainval2017.zip
wget -q --show-progress http://images.cocodataset.org/annotations/stuff_annotations_trainval2017.zip
wget -q --show-progress http://images.cocodataset.org/annotations/image_info_test2017.zip

# Extract annotations
echo "Extracting annotations..."
unzip -q annotations_trainval2017.zip
unzip -q stuff_annotations_trainval2017.zip
unzip -q image_info_test2017.zip

# Clean up
rm annotations_trainval2017.zip stuff_annotations_trainval2017.zip image_info_test2017.zip

echo "COCO dataset download complete!"
echo "Location: $(pwd)"

#!/bin/bash
# Download COCO train and val only (~26GB)

mkdir -p coco/images
cd coco/images

# Download train and val only
echo "Downloading train and val images..."
wget -q --show-progress http://images.cocodataset.org/zips/train2017.zip &
wget -q --show-progress http://images.cocodataset.org/zips/val2017.zip &
wait

# Extract
echo "Extracting..."
unzip -q train2017.zip &
unzip -q val2017.zip &
wait
rm train2017.zip val2017.zip

# Download annotations
cd ../
wget -q --show-progress http://images.cocodataset.org/annotations/annotations_trainval2017.zip
unzip -q annotations_trainval2017.zip
rm annotations_trainval2017.zip

echo "COCO train+val download complete!"

ImageNet

ImageNet requires registration. Use these scripts after obtaining download credentials.

:::caution[Registration Required] ImageNet requires accepting terms and obtaining download credentials from image-net.org :::

#!/bin/bash
# Requires: username and access key from ImageNet

# Set your credentials
USERNAME="your_username"
ACCESS_KEY="your_access_key"

# Download training data (138GB)
wget --user=$USERNAME --password=$ACCESS_KEY \
  https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_train.tar

# Download validation data (6.3GB)
wget --user=$USERNAME --password=$ACCESS_KEY \
  https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_val.tar

# Extract training data
mkdir -p train && tar -xf ILSVRC2012_img_train.tar -C train/
cd train
for f in *.tar; do
  d=$(basename "$f" .tar)
  mkdir -p "$d"
  tar -xf "$f" -C "$d"
  rm "$f"
done
cd ..

# Extract validation data
mkdir -p val && tar -xf ILSVRC2012_img_val.tar -C val/

echo "ImageNet download complete!"

CIFAR-10 / CIFAR-100

Small datasets that download quickly - typically handled by PyTorch/TensorFlow directly.

import torchvision
import torchvision.transforms as transforms

# CIFAR-10 (60,000 images, ~170MB)
trainset = torchvision.datasets.CIFAR10(
    root='./data',
    train=True,
    download=True,
    transform=transforms.ToTensor()
)

# CIFAR-100 (60,000 images, ~170MB)
trainset = torchvision.datasets.CIFAR100(
    root='./data',
    train=True,
    download=True,
    transform=transforms.ToTensor()
)

import tensorflow as tf

# CIFAR-10
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()

# CIFAR-100
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar100.load_data()

Generic Download Script Template

Use this template for custom datasets:

#!/bin/bash
# Template for downloading any dataset

# Configuration
DATASET_NAME="my_dataset"
DATASET_URL="https://example.com/dataset.zip"
TARGET_DIR="./datasets/${DATASET_NAME}"

# Create directory
mkdir -p "$TARGET_DIR"
cd "$TARGET_DIR"

# Download with progress bar and resume capability
wget -c --show-progress "$DATASET_URL"

# Extract based on file type
FILENAME=$(basename "$DATASET_URL")
case "$FILENAME" in
  *.zip)
    echo "Extracting ZIP..."
    unzip -q "$FILENAME"
    ;;
  *.tar.gz|*.tgz)
    echo "Extracting TAR.GZ..."
    tar -xzf "$FILENAME"
    ;;
  *.tar)
    echo "Extracting TAR..."
    tar -xf "$FILENAME"
    ;;
esac

# Clean up archive
rm "$FILENAME"

echo "Dataset downloaded to: $TARGET_DIR"

Tips for Large Downloads

Using aria2 for Faster Downloads

# Install aria2 (much faster than wget)
sudo apt install aria2

# Download with multiple connections
aria2c -x 16 -s 16 http://images.cocodataset.org/zips/train2017.zip

Resume Interrupted Downloads

# wget automatically resumes with -c flag
wget -c http://example.com/large-dataset.zip

# aria2 resumes automatically
aria2c -x 16 http://example.com/large-dataset.zip

Check Downloaded File Integrity

# If dataset provides checksums
md5sum downloaded_file.zip
sha256sum downloaded_file.zip

# Compare with provided checksum
echo "expected_checksum  downloaded_file.zip" | md5sum -c

img2dataset: Large-Scale Image Dataset Creation

img2dataset is a powerful tool for downloading large-scale image datasets from URLs. It can download, resize, and package 100M+ images efficiently.

Key Features

Fast: Download 100M URLs in ~20 hours on one machine
Formats: WebDataset (recommended), files, parquet, tfrecord
Distributed: Multi-processing and multi-node support
Resume: Incremental downloads if interrupted
Compliant: Respects robots.txt and no-AI headers

Installation

pip install img2dataset

Basic Usage

# Create a CSV with columns: url, caption (optional)
# Format: url,caption
# https://example.com/image1.jpg,A beautiful sunset
# https://example.com/image2.jpg,Mountain landscape

img2dataset \
  --url_list=urls.csv \
  --output_folder=dataset \
  --thread_count=64 \
  --image_size=256 \
  --output_format=webdataset \
  --input_format=csv \
  --url_col=url \
  --caption_col=caption

# Download CC3M dataset (~1 hour)
wget https://storage.googleapis.com/conceptual_12m/cc3m.tsv

img2dataset \
  --url_list=cc3m.tsv \
  --input_format=tsv \
  --url_col=url \
  --caption_col=caption \
  --output_folder=cc3m \
  --processes_count=16 \
  --thread_count=64 \
  --image_size=256 \
  --output_format=webdataset

# On each node, set different partition
# Node 1: --distributor_type=multiprocessing --subjob_size=1000 --processes_count=1 --partition_id=0
# Node 2: --distributor_type=multiprocessing --subjob_size=1000 --processes_count=1 --partition_id=1

img2dataset \
  --url_list=large_dataset.parquet \
  --output_folder=output \
  --processes_count=1 \
  --thread_count=256 \
  --image_size=384 \
  --distributor_type=multiprocessing \
  --partition_id=0 \
  --partitions_number=10

Common Options

# Image processing
--image_size=256              # Resize images
--resize_mode=border          # border, center_crop, keep_ratio, etc.
--resize_only_if_bigger=True  # Don't upscale small images

# Performance
--processes_count=16          # Number of processes
--thread_count=64             # Threads per process
--distributor_type=multiprocessing

# Output format
--output_format=webdataset    # webdataset, tfrecord, parquet, files
--output_folder=dataset

# Resume interrupted downloads
--incremental_mode=incremental  # Resume from where it stopped

# Filtering
--min_image_size=200          # Skip images smaller than 200px
--max_image_area=512*512      # Skip very large images
--enable_wandb=True           # Track progress with Weights & Biases

Example: LAION-400M Subset

# Download metadata
wget https://the-eye.eu/public/AI/cah/laion400m-met-release/laion400m-meta/part-00000-5b54c5d5-bbcf-484d-a2ce-0d6f73df1a36-c000.snappy.parquet

# Download images
img2dataset \
  --url_list=part-00000-5b54c5d5-bbcf-484d-a2ce-0d6f73df1a36-c000.snappy.parquet \
  --input_format=parquet \
  --url_col=URL \
  --caption_col=TEXT \
  --output_format=webdataset \
  --output_folder=laion400m \
  --processes_count=16 \
  --thread_count=128 \
  --image_size=384 \
  --resize_mode=keep_ratio \
  --resize_only_if_bigger=True \
  --enable_wandb=True

Loading Downloaded Dataset

import webdataset as wds
from torch.utils.data import DataLoader

# Create dataset
dataset = (
    wds.WebDataset("dataset/{00000..00099}.tar")
    .decode("pil")
    .to_tuple("jpg;png", "txt")
    .map_tuple(transforms.ToTensor(), lambda x: x)
)

# Create dataloader
dataloader = DataLoader(dataset, batch_size=32, num_workers=4)

for images, captions in dataloader:
    # Train your model
    pass

import tensorflow as tf
import webdataset as wds

dataset = wds.WebDataset("dataset/{00000..00099}.tar")
dataset = dataset.decode("pil").to_tuple("jpg;png", "txt")

# Convert to TensorFlow dataset
tf_dataset = tf.data.Dataset.from_generator(
    lambda: dataset,
    output_signature=(
        tf.TensorSpec(shape=(None, None, 3), dtype=tf.uint8),
        tf.TensorSpec(shape=(), dtype=tf.string)
    )
)

Common Use Cases

1. Create Custom Dataset from URLs

# Your own list of image URLs
img2dataset --url_list=my_urls.txt --output_folder=my_dataset

2. Download Public Datasets

CC3M: 3M image-text pairs (~1 hour)
CC12M: 12M pairs (~5 hours)
LAION-400M: 400M pairs (distributed)
LAION-5B: 5B pairs (distributed cluster)

3. Web Scraping Results Convert web scraping results into training datasets.

Tips

:::tip[Performance Optimization]

Use --thread_count=128 or higher for network-bound downloads
Use --processes_count=16-32 based on your CPU cores
Use --distributor_type=multiprocessing for single machine
For clusters, use --distributor_type=pyspark :::

:::caution[Respect robots.txt] img2dataset respects robots.txt by default and skips images with noai, noindex, or noimageai headers. Keep these defaults enabled to be a good internet citizen. :::

Monitoring Progress

# Enable Weights & Biases tracking
img2dataset --enable_wandb=True --wandb_project=my_dataset

# Or check output folder
ls -lh dataset/

Resources

File Permissions - Make scripts executable
Data Loading Optimization - Efficient data loading
HPC Storage - Managing datasets on clusters
GPU Monitoring - Monitor download progress