Navigation

Dataset Download Scripts

Download popular deep learning datasets with ready-to-use bash scripts. Get ImageNet, COCO, CIFAR, MNIST, and other common datasets with automated download and extraction.

Overview

This page provides optimized scripts for downloading popular datasets. These scripts include:

  • Parallel downloads for faster completion
  • Automatic extraction and cleanup
  • Progress tracking
  • Error handling

COCO Dataset

The Microsoft COCO (Common Objects in Context) dataset is widely used for object detection, segmentation, and captioning tasks.

Dataset Size: ~25GB (train), ~1GB (val), ~6GB (test)

#!/bin/bash
# Download COCO 2017 dataset with all splits and annotations

# Create directory structure
mkdir -p coco/images
cd coco/images

# Download images (in parallel for speed)
echo "Downloading images..."
wget -q --show-progress http://images.cocodataset.org/zips/train2017.zip &
wget -q --show-progress http://images.cocodataset.org/zips/val2017.zip &
wget -q --show-progress http://images.cocodataset.org/zips/test2017.zip &
wait

# Extract images
echo "Extracting images..."
unzip -q train2017.zip &
unzip -q val2017.zip &
unzip -q test2017.zip &
wait

# Clean up zips
rm train2017.zip val2017.zip test2017.zip

# Download annotations
cd ../
echo "Downloading annotations..."
wget -q --show-progress http://images.cocodataset.org/annotations/annotations_trainval2017.zip
wget -q --show-progress http://images.cocodataset.org/annotations/stuff_annotations_trainval2017.zip
wget -q --show-progress http://images.cocodataset.org/annotations/image_info_test2017.zip

# Extract annotations
echo "Extracting annotations..."
unzip -q annotations_trainval2017.zip
unzip -q stuff_annotations_trainval2017.zip
unzip -q image_info_test2017.zip

# Clean up
rm annotations_trainval2017.zip stuff_annotations_trainval2017.zip image_info_test2017.zip

echo "COCO dataset download complete!"
echo "Location: $(pwd)"
#!/bin/bash
# Download COCO train and val only (~26GB)

mkdir -p coco/images
cd coco/images

# Download train and val only
echo "Downloading train and val images..."
wget -q --show-progress http://images.cocodataset.org/zips/train2017.zip &
wget -q --show-progress http://images.cocodataset.org/zips/val2017.zip &
wait

# Extract
echo "Extracting..."
unzip -q train2017.zip &
unzip -q val2017.zip &
wait
rm train2017.zip val2017.zip

# Download annotations
cd ../
wget -q --show-progress http://images.cocodataset.org/annotations/annotations_trainval2017.zip
unzip -q annotations_trainval2017.zip
rm annotations_trainval2017.zip

echo "COCO train+val download complete!"

ImageNet

ImageNet requires registration. Use these scripts after obtaining download credentials.

:::caution[Registration Required] ImageNet requires accepting terms and obtaining download credentials from image-net.org :::

#!/bin/bash
# Requires: username and access key from ImageNet

# Set your credentials
USERNAME="your_username"
ACCESS_KEY="your_access_key"

# Download training data (138GB)
wget --user=$USERNAME --password=$ACCESS_KEY \
  https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_train.tar

# Download validation data (6.3GB)
wget --user=$USERNAME --password=$ACCESS_KEY \
  https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_val.tar

# Extract training data
mkdir -p train && tar -xf ILSVRC2012_img_train.tar -C train/
cd train
for f in *.tar; do
  d=$(basename "$f" .tar)
  mkdir -p "$d"
  tar -xf "$f" -C "$d"
  rm "$f"
done
cd ..

# Extract validation data
mkdir -p val && tar -xf ILSVRC2012_img_val.tar -C val/

echo "ImageNet download complete!"

CIFAR-10 / CIFAR-100

Small datasets that download quickly - typically handled by PyTorch/TensorFlow directly.

import torchvision
import torchvision.transforms as transforms

# CIFAR-10 (60,000 images, ~170MB)
trainset = torchvision.datasets.CIFAR10(
    root='./data',
    train=True,
    download=True,
    transform=transforms.ToTensor()
)

# CIFAR-100 (60,000 images, ~170MB)
trainset = torchvision.datasets.CIFAR100(
    root='./data',
    train=True,
    download=True,
    transform=transforms.ToTensor()
)
import tensorflow as tf

# CIFAR-10
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()

# CIFAR-100
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar100.load_data()

Generic Download Script Template

Use this template for custom datasets:

#!/bin/bash
# Template for downloading any dataset

# Configuration
DATASET_NAME="my_dataset"
DATASET_URL="https://example.com/dataset.zip"
TARGET_DIR="./datasets/${DATASET_NAME}"

# Create directory
mkdir -p "$TARGET_DIR"
cd "$TARGET_DIR"

# Download with progress bar and resume capability
wget -c --show-progress "$DATASET_URL"

# Extract based on file type
FILENAME=$(basename "$DATASET_URL")
case "$FILENAME" in
  *.zip)
    echo "Extracting ZIP..."
    unzip -q "$FILENAME"
    ;;
  *.tar.gz|*.tgz)
    echo "Extracting TAR.GZ..."
    tar -xzf "$FILENAME"
    ;;
  *.tar)
    echo "Extracting TAR..."
    tar -xf "$FILENAME"
    ;;
esac

# Clean up archive
rm "$FILENAME"

echo "Dataset downloaded to: $TARGET_DIR"

Tips for Large Downloads

Using aria2 for Faster Downloads

# Install aria2 (much faster than wget)
sudo apt install aria2

# Download with multiple connections
aria2c -x 16 -s 16 http://images.cocodataset.org/zips/train2017.zip

Resume Interrupted Downloads

# wget automatically resumes with -c flag
wget -c http://example.com/large-dataset.zip

# aria2 resumes automatically
aria2c -x 16 http://example.com/large-dataset.zip

Check Downloaded File Integrity

# If dataset provides checksums
md5sum downloaded_file.zip
sha256sum downloaded_file.zip

# Compare with provided checksum
echo "expected_checksum  downloaded_file.zip" | md5sum -c

img2dataset: Large-Scale Image Dataset Creation

img2dataset is a powerful tool for downloading large-scale image datasets from URLs. It can download, resize, and package 100M+ images efficiently.

Key Features

  • Fast: Download 100M URLs in ~20 hours on one machine
  • Formats: WebDataset (recommended), files, parquet, tfrecord
  • Distributed: Multi-processing and multi-node support
  • Resume: Incremental downloads if interrupted
  • Compliant: Respects robots.txt and no-AI headers

Installation

pip install img2dataset

Basic Usage

# Create a CSV with columns: url, caption (optional)
# Format: url,caption
# https://example.com/image1.jpg,A beautiful sunset
# https://example.com/image2.jpg,Mountain landscape

img2dataset \
  --url_list=urls.csv \
  --output_folder=dataset \
  --thread_count=64 \
  --image_size=256 \
  --output_format=webdataset \
  --input_format=csv \
  --url_col=url \
  --caption_col=caption
# Download CC3M dataset (~1 hour)
wget https://storage.googleapis.com/conceptual_12m/cc3m.tsv

img2dataset \
  --url_list=cc3m.tsv \
  --input_format=tsv \
  --url_col=url \
  --caption_col=caption \
  --output_folder=cc3m \
  --processes_count=16 \
  --thread_count=64 \
  --image_size=256 \
  --output_format=webdataset
# On each node, set different partition
# Node 1: --distributor_type=multiprocessing --subjob_size=1000 --processes_count=1 --partition_id=0
# Node 2: --distributor_type=multiprocessing --subjob_size=1000 --processes_count=1 --partition_id=1

img2dataset \
  --url_list=large_dataset.parquet \
  --output_folder=output \
  --processes_count=1 \
  --thread_count=256 \
  --image_size=384 \
  --distributor_type=multiprocessing \
  --partition_id=0 \
  --partitions_number=10

Common Options

# Image processing
--image_size=256              # Resize images
--resize_mode=border          # border, center_crop, keep_ratio, etc.
--resize_only_if_bigger=True  # Don't upscale small images

# Performance
--processes_count=16          # Number of processes
--thread_count=64             # Threads per process
--distributor_type=multiprocessing

# Output format
--output_format=webdataset    # webdataset, tfrecord, parquet, files
--output_folder=dataset

# Resume interrupted downloads
--incremental_mode=incremental  # Resume from where it stopped

# Filtering
--min_image_size=200          # Skip images smaller than 200px
--max_image_area=512*512      # Skip very large images
--enable_wandb=True           # Track progress with Weights & Biases

Example: LAION-400M Subset

# Download metadata
wget https://the-eye.eu/public/AI/cah/laion400m-met-release/laion400m-meta/part-00000-5b54c5d5-bbcf-484d-a2ce-0d6f73df1a36-c000.snappy.parquet

# Download images
img2dataset \
  --url_list=part-00000-5b54c5d5-bbcf-484d-a2ce-0d6f73df1a36-c000.snappy.parquet \
  --input_format=parquet \
  --url_col=URL \
  --caption_col=TEXT \
  --output_format=webdataset \
  --output_folder=laion400m \
  --processes_count=16 \
  --thread_count=128 \
  --image_size=384 \
  --resize_mode=keep_ratio \
  --resize_only_if_bigger=True \
  --enable_wandb=True

Loading Downloaded Dataset

import webdataset as wds
from torch.utils.data import DataLoader

# Create dataset
dataset = (
    wds.WebDataset("dataset/{00000..00099}.tar")
    .decode("pil")
    .to_tuple("jpg;png", "txt")
    .map_tuple(transforms.ToTensor(), lambda x: x)
)

# Create dataloader
dataloader = DataLoader(dataset, batch_size=32, num_workers=4)

for images, captions in dataloader:
    # Train your model
    pass
import tensorflow as tf
import webdataset as wds

dataset = wds.WebDataset("dataset/{00000..00099}.tar")
dataset = dataset.decode("pil").to_tuple("jpg;png", "txt")

# Convert to TensorFlow dataset
tf_dataset = tf.data.Dataset.from_generator(
    lambda: dataset,
    output_signature=(
        tf.TensorSpec(shape=(None, None, 3), dtype=tf.uint8),
        tf.TensorSpec(shape=(), dtype=tf.string)
    )
)

Common Use Cases

1. Create Custom Dataset from URLs

# Your own list of image URLs
img2dataset --url_list=my_urls.txt --output_folder=my_dataset

2. Download Public Datasets

  • CC3M: 3M image-text pairs (~1 hour)
  • CC12M: 12M pairs (~5 hours)
  • LAION-400M: 400M pairs (distributed)
  • LAION-5B: 5B pairs (distributed cluster)

3. Web Scraping Results Convert web scraping results into training datasets.

Tips

:::tip[Performance Optimization]

  • Use --thread_count=128 or higher for network-bound downloads
  • Use --processes_count=16-32 based on your CPU cores
  • Use --distributor_type=multiprocessing for single machine
  • For clusters, use --distributor_type=pyspark :::

:::caution[Respect robots.txt] img2dataset respects robots.txt by default and skips images with noai, noindex, or noimageai headers. Keep these defaults enabled to be a good internet citizen. :::

Monitoring Progress

# Enable Weights & Biases tracking
img2dataset --enable_wandb=True --wandb_project=my_dataset

# Or check output folder
ls -lh dataset/

Resources