Cluster Distributed Runs#
This page shows supported patterns for running DeepForest across multiple GPUs and multiple nodes on a Slurm-managed cluster (for example HiPerGator).
Slurm: sbatch and srun#
sbatch requests the allocation (nodes, GPUs, tasks, memory, time). srun inside that batch script starts a job step within the same allocation. It does not submit a second job or double-charge the scheduler.
Match #SBATCH --ntasks-per-node to devices (one Slurm task per GPU) and #SBATCH --nodes to num_nodes. For multi-GPU DDP, launch with srun. For a single GPU, the cluster train script runs the command directly in the batch step.
Example launchers live under src/deepforest/scripts/HPC/.
Environment#
ml conda
eval "$(conda shell.bash hook)"
conda activate predict
cd /path/to/DeepForest
mkdir -p slurm_logs
Train#
Use src/deepforest/scripts/HPC/run_cluster_train.sbatch for production training and smoke tests. The launcher script is run_cluster_train.sh.
Production training (single GPU)#
Defaults use TRAIN_MODE=train and CONFIG_NAME=bird. Submit from the repo root:
sbatch src/deepforest/scripts/HPC/run_cluster_train.sbatch
Hydra overrides and resume:
export COMET_EXPERIMENT_NAME="exp_lr_0.0005"
sbatch src/deepforest/scripts/HPC/run_cluster_train.sbatch train.lr=0.0005 train.epochs=80
RESUME_CKPT=/path/to/last.ckpt sbatch src/deepforest/scripts/HPC/run_cluster_train.sbatch
Multi-GPU or multi-node training: set Slurm resources at submit time and pass matching Hydra settings if needed. The script infers SCENARIO from the allocation.
sbatch --nodes=2 --ntasks-per-node=2 --gpus-per-node=2 --cpus-per-task=8 --mem=128G --time=15:00:00 \
src/deepforest/scripts/HPC/run_cluster_train.sbatch \
--strategy ddp devices=2 num_nodes=2
Smoke tests#
Smoke tests use bundled OSBS sample data (TRAIN_MODE=smoke, CONFIG_NAME=smoke, 1 epoch). Set SCENARIO and match #SBATCH resources:
# 1 GPU
TRAIN_MODE=smoke SCENARIO=1gpu sbatch --nodes=1 --ntasks-per-node=1 --gpus-per-node=1 \
--cpus-per-task=8 --mem=32G --time=00:30:00 \
src/deepforest/scripts/HPC/run_cluster_train.sbatch
# Multi-GPU (one node)
TRAIN_MODE=smoke SCENARIO=multigpu GPUS_PER_NODE=2 sbatch --nodes=1 --ntasks-per-node=2 --gpus-per-node=2 \
--cpus-per-task=8 --mem=64G --time=00:45:00 \
src/deepforest/scripts/HPC/run_cluster_train.sbatch
# Multi-node
TRAIN_MODE=smoke SCENARIO=multinode GPUS_PER_NODE=2 NNODES=2 sbatch --nodes=2 --ntasks-per-node=2 --gpus-per-node=2 \
--cpus-per-task=8 --mem=64G --time=01:00:00 \
src/deepforest/scripts/HPC/run_cluster_train.sbatch
Optional: export COMET_EXPERIMENT_NAME="my-smoke-run" before sbatch. Disable Comet with USE_COMET=0.
Train directly in a batch script#
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --gpus-per-node=2
srun uv run deepforest train \
--strategy ddp \
accelerator=gpu \
devices=2 \
num_nodes=2 \
train.csv_file=/path/to/train.csv \
train.root_dir=/path/to/train_images \
validation.csv_file=/path/to/val.csv \
validation.root_dir=/path/to/val_images
Evaluate#
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --gpus-per-node=2
srun uv run deepforest evaluate \
/path/to/ground_truth.csv \
--root-dir /path/to/images \
--save-predictions eval_preds.csv \
-o eval_metrics.csv \
--strategy ddp \
accelerator=gpu \
devices=2 \
num_nodes=2
Predict From CSV#
For the cluster regression test and example launcher (submit from the repo root):
sbatch src/deepforest/scripts/HPC/run_cluster_predict_test.sbatch
To run your own CSV prediction job directly:
srun uv run deepforest predict \
/path/to/images.csv \
--mode csv \
--root-dir /path/to/images \
-o predictions.csv \
--strategy ddp \
accelerator=gpu \
devices=2 \
num_nodes=2
Predict A Large Tile#
For large rasters on a cluster, prefer predict_tile(..., dataloader_strategy="window").
The ready-to-run test launcher is:
sbatch src/deepforest/scripts/HPC/run_cluster_predict_tile_test.sbatch
To run a tiled prediction job directly:
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --gpus-per-node=2
srun uv run python tests/cluster_predict_tile_driver.py \
--input-path /path/to/tile.tif \
--output-path tile_predictions.csv \
--model-name weecology/everglades-bird-species-detector \
--patch-size 1500 \
--patch-overlap 0 \
--dataloader-strategy window \
--devices 2 \
--num-nodes 2
See also the multi-GPU and multi-node guide.