library(petrographer)
library(dplyr)
library(ggplot2)
library(fs)
# Validate dataset structure
validate_dataset("data/processed/my_dataset")
This vignette walks through training custom object detection models for petrographic analysis. It covers both local training (accessible to all users) and HPC training (specific to University of Florida’s HiPerGator, but adaptable to other SLURM-based systems).
Prerequisites
Before training, ensure you have:
- Dataset in COCO format with train/valid splits
- Python environment with detectron2, sahi, torch installed
- GPU access (recommended) - training on CPU is slow but possible
Dataset Preparation
Validate Your Dataset
Use validate_dataset()
to check your dataset structure and COCO format:
Your dataset should have this structure:
my_dataset/
├── train/
│ ├── _annotations.coco.json
│ └── *.jpg
└── valid/
├── _annotations.coco.json
└── *.jpg
Optional: Dataset Slicing
For large, heterogeneous image sizes, use SAHI slicing to create uniform tiles:
# Slice large images into 1024x1024 tiles
slice_dataset(
input_dir = "data/raw/large_images",
output_dir = "data/processed/my_dataset_sliced",
slice_size = 1024,
overlap = 0.2
)
Optional: Pin Your Dataset
Pin datasets for versioning and easy reference:
# Pin dataset to local board
pin_dataset(
data_dir = "data/processed/my_dataset",
dataset_id = "my_dataset_v1"
)
# List pinned datasets
list_datasets()
Local Training
Local training runs on your machine (CPU or GPU). This is the simplest way to train a model.
Basic Training
# Train a model locally
model_id <- train_model(
data_dir = "data/processed/my_dataset",
model_id = "my_detector_v1",
num_classes = 3, # Number of object classes
max_iter = 12000, # Training iterations
device = "cuda" # "cpu", "cuda", or "mps"
)
The model will be automatically pinned to your local board (.petrographer/
) with full metadata.
Key Parameters Explained
Dataset specification:
# Option 1: Use a directory path
train_model(data_dir = "data/processed/my_dataset", ...)
# Option 2: Use a pinned dataset ID
train_model(dataset_id = "my_dataset_v1", ...)
Model architecture:
# Backbone options: "resnet50" (default), "resnet101", "resnext101"
backbone = "resnet50"
# Freeze early layers for faster training (0-5)
# Higher = more frozen = faster training, less adaptation
freeze_at = 2 # Freeze stem + res2 (recommended)
Training schedule:
# Iterations (depends on dataset size)
max_iter = 12000 # ~1-2 hours on modern GPU
# Learning rate (NULL = auto-scaling based on batch size and freeze_at)
learning_rate = NULL # Let petrographer compute optimal rate
# Batch size (NA = auto: 2 images per GPU)
ims_per_batch = NA
# Validation frequency
eval_period = 1000 # Evaluate every 1000 iterations
What Happens During Training
- Dataset validation - Checks COCO format and class consistency
- Config generation - Computes batch size, learning rate, workers
-
Training - Calls Detectron2 via Python with:
- WarmupCosineLR schedule
- Differential learning rates (0.1x backbone, 1.0x head)
- Data augmentations (flips, rotations, color jitter)
-
Auto-pinning - Saves model to
.petrographer/
with metadata
Loading and Testing Trained Models
Load a Trained Model
# Load from local board (checks .petrographer/ first, then hub)
model <- from_pretrained("my_detector_v1", device = "cuda")
# Alternatively, force local board
model <- from_pretrained("my_detector_v1", board = "local", device = "cuda")
Test on Validation Images
# Get validation images
val_images <- dir_ls("data/processed/my_dataset/valid", glob = "*.jpg")
# Test on a single image
results <- predict(model, val_images[1])
# View detections
results |>
select(image_id, category_name, confidence, area, perimeter)
# Test on multiple images
batch_results <- predict_images(
image_dir = "data/processed/my_dataset/valid",
model = model,
output_dir = "results/validation_test"
)
Evaluating Training Results
Parse Training Metrics
# Load metrics from training
eval_result <- evaluate_training(
model_dir = path(".petrographer", model_id, "current")
)
# Summary statistics
eval_result$summary
Plot Training Curves
# Loss curves
eval_result$training_data |>
select(iteration, contains("loss")) |>
tidyr::pivot_longer(-iteration, names_to = "loss_type", values_to = "loss") |>
filter(!is.na(loss)) |>
ggplot(aes(iteration, loss, color = loss_type)) +
geom_line() +
facet_wrap(~loss_type, scales = "free_y") +
labs(title = "Training Loss") +
theme_minimal()
# Validation metrics
eval_result$validation_data |>
tidyr::pivot_longer(-iteration, names_to = "metric", values_to = "value") |>
filter(!is.na(value)) |>
ggplot(aes(iteration, value)) +
geom_line() +
geom_point(size = 0.5) +
facet_wrap(~metric, scales = "free_y") +
labs(title = "Validation Metrics") +
theme_minimal()
Evaluate on Validation Set
# Run full COCO evaluation with SAHI
sahi_eval <- evaluate_model_sahi(
model = model,
annotation_json = "data/processed/my_dataset/valid/_annotations.coco.json",
image_dir = "data/processed/my_dataset/valid",
use_slicing = FALSE
)
# COCO metrics
sahi_eval$summary |>
filter(metric %in% c("AP", "AP50", "AP75", "AR@100"))
HPC Training (Advanced)
Note
This section is specific to users of University of Florida’s HiPerGator. However, the code demonstrates patterns that can be adapted to other SLURM-based HPC systems by modifying the
hipergator
package integration.
HPC training is useful for: - Large datasets requiring long training times - Multi-GPU training - Running multiple experiments in parallel
One-Time Setup
Configure your HPC connection once per session:
library(hipergator)
# Configure HiPerGator connection
hpg_configure(
host = "hpg",
base_dir = "/blue/mygroup/myusername/petrographer"
)
You can also set these via .Renviron
:
PETROGRAPHER_HPC_HOST=hpg.rc.ufl.edu
PETROGRAPHER_HPC_BASE_DIR=/blue/mygroup/myusername/petrographer
Run HPC Training
Training mode is auto-detected based on hipergator
configuration. Use the same train_model()
call:
# Same call as local training - HPC mode auto-detected!
model_id <- train_model(
data_dir = "data/processed/my_dataset",
model_id = "my_detector_hpc_v1",
num_classes = 3,
max_iter = 12000
# No hpc_* params needed - uses hipergator config!
)
Resource Configuration
Customize HPC resources for your job:
model_id <- train_model(
data_dir = "data/processed/my_dataset",
model_id = "my_large_model",
num_classes = 5,
max_iter = 20000,
# HPC resource hints
gpus = 2, # Multi-GPU training
hpc_cpus_per_task = 28, # CPUs (default: 14 per GPU)
hpc_mem = "48gb", # Memory (default: 24gb per GPU)
# Batch size must be divisible by GPU count
ims_per_batch = 16 # 8 per GPU
)
How HPC Training Works
Behind the scenes, train_model()
:
-
Uploads artifacts to HPC:
- Dataset →
base_dir/datasets/{dataset_id}/
- Training script →
base_dir/scripts/train.py
- Both use
rsync
- unchanged files are skipped!
- Dataset →
-
Submits SLURM job with:
- Working directory:
base_dir
- Command uses relative paths (clean, portable)
- GPU allocation via
hipergator
resource spec
- Working directory:
Monitors job until completion
-
Downloads results:
-
model_best.pth
(weights) -
config.yaml
(model config) -
metadata.json
(class names) -
metrics.json
(training history)
-
Pins model to local board for easy loading
Shared Directory Structure
HPC uses an efficient shared structure to avoid re-uploading:
/blue/base_dir/
├── datasets/
│ └── my_dataset_v1/ # Shared across all models
│ ├── train/
│ └── valid/
├── scripts/
│ └── train.py # Shared script
└── models/
└── my_detector_hpc_v1/
└── 20250113120000/ # Run ID (timestamp)
└── output/
├── model_best.pth
├── config.yaml
└── metadata.json
This means: - First training uploads everything - Subsequent trainings only upload changed files - Multiple models can share datasets and scripts
Monitoring and Troubleshooting
# Check HPC connection
hipergator::hpg_check_connection()
# View your base directory
hipergator::hpg_show_base_dir()
# List running jobs (if you have access)
# hipergator provides job monitoring functions
If training fails: 1. Check error messages in the R console 2. Verify base_dir
exists on HPC 3. Ensure dataset paths are correct 4. Check GPU availability on your partition
Advanced Topics
Custom Learning Rates
Override auto-scaling for fine control:
train_model(
data_dir = "data/processed/my_dataset",
model_id = "my_model",
num_classes = 3,
learning_rate = 0.0005, # Manual LR (head rate; backbone gets 0.1x)
max_iter = 12000
)
The default auto-scaling uses: - Batch size (linear scaling) - freeze_at
(more frozen = higher LR) - Base rate: 0.001 for freeze_at = 2
Backbone Selection
Choose a backbone based on your needs:
# ResNet50 (default) - Fast, good for most tasks
backbone = "resnet50"
# ResNet101 - Deeper, better accuracy, slower
backbone = "resnet101"
# ResNeXt101 - Best accuracy, slowest, needs more memory
backbone = "resnext101"
Freeze Strategies
Control which layers to train:
# freeze_at = 0: Train entire backbone (slow, best adaptation)
# freeze_at = 1: Freeze stem only (good balance)
# freeze_at = 2: Freeze stem + res2 (default, faster)
# freeze_at = 3: Freeze stem + res2 + res3
# freeze_at = 4: Freeze stem + res2 + res3 + res4
# freeze_at = 5: Freeze entire backbone (fastest, least adaptation)
train_model(
...,
freeze_at = 1, # Less frozen = slower but better domain adaptation
learning_rate = NULL # Auto-scale adjusts for freeze_at
)
Using Pinned Datasets
Pin datasets once, reuse many times:
# Pin dataset
pin_dataset("data/processed/my_dataset", "my_dataset_v1")
# Train multiple models with same dataset
train_model(dataset_id = "my_dataset_v1", model_id = "model_a", ...)
train_model(dataset_id = "my_dataset_v1", model_id = "model_b", ...)
Publishing to Hub (Maintainers)
To share models with the community, maintainers can publish to the public hub:
# Create hub board (maintainers only)
hub_board <- pins::board_folder(
here::here("pkgdown/assets/pins"),
versioned = TRUE
)
# Pin model to hub
pin_model(
model_dir = ".petrographer/my_detector_v1/current",
model_id = "my_detector_v1",
board = hub_board,
metadata = list(
description = "Shell detector optimized for XPL images",
backbone = "resnet50",
num_classes = 5
)
)
# Update manifest
pins::write_board_manifest(hub_board)
# Deploy via pkgdown
# pkgdown::build_site()
# git commit & push
See data-raw/publish-to-hub.R
in the package source for the complete workflow.
Next Steps
-
Prediction workflow: See
vignette("whole-slide-basics")
for inference -
Model library: See
vignette("model-library")
for available pretrained models -
Function reference: See
?train_model
,?from_pretrained
,?evaluate_model_sahi
Troubleshooting
Training fails with CUDA out of memory: - Reduce ims_per_batch
(e.g., try 4 or 2) - Use a smaller backbone
(resnet50 instead of resnext101) - Increase freeze_at
to train fewer parameters
Training is very slow: - Ensure device = "cuda"
(not “cpu”) - Check GPU availability: reticulate::py_run_string("import torch; print(torch.cuda.is_available())")
- Reduce eval_period
(evaluate less frequently)
Class names not showing in predictions: - Ensure your COCO JSON has a categories
list with name
fields - Check metadata.json
was created: list.files(".petrographer/my_model/current")
- The model should load category_mapping
automatically
HPC job fails: - Verify base_dir
exists: hipergator::hpg_show_base_dir()
- Check dataset paths are correct relative to data_dir
- Ensure GPU partition has availability