Model Management and Resource Control

This guide explains how to manage deep learning models and control GPU/CPU resources in HiTMicTools.

Overview

HiTMicTools uses multiple neural network models for different analysis tasks:

Focus Restoration: NAFNet models for brightfield and fluorescence channels
Segmentation: MonaiUnet or RT-DETR for cell/object detection
Classification: ResNet-based classifiers for cell type and quality
PI Classification: Models for propidium iodide staining detection
Out-of-Focus Detection: Quality control models

1. Model Collections (Recommended Approach)

Model collections are ZIP bundles that contain all required models and configurations for a complete analysis pipeline.

Advantages of Model Collections

Simplified Deployment: Single file contains all models
Version Consistency: All models are compatible and tested together
Tracking Support: Includes btrack configuration files
Easy Distribution: Share one file instead of multiple directories
Reproducibility: Ensures everyone uses the same model versions

Available Model Collections

Collection	Pipeline	Tracking	Description
`model_collection_tracking_20250529.zip`	ASCT_focusrestore	Yes	Full pipeline with tracking
`model_collection_scsegm_20251106.zip`	ASCT_scsegm	Optional	RT-DETR instance segmentation
`model_collection_oof_20251014.zip`	Various	No	With out-of-focus detection

Using Model Collections

Basic Configuration

models:
  model_collection: "./models/model_collection_tracking_20250529.zip"

That’s it! The pipeline automatically extracts and loads all required models.

What’s Inside a Model Collection?

model_collection_tracking_20250529.zip
├── bf_focus_restorer/
│   ├── model.pth                    # Brightfield focus model weights
│   └── model_metadata.json          # Model architecture config
├── fl_focus_restorer/
│   ├── model.pth                    # Fluorescence focus model weights
│   └── model_metadata.json
├── segmentation/
│   ├── model.pth                    # Segmentation model weights
│   └── model_metadata.json
├── cell_classifier/
│   ├── model.onnx                   # Cell classifier (ONNX format)
│   └── model_metadata.json
├── pi_classifier/
│   ├── model.joblib                 # PI classifier (scikit-learn)
│   └── model_metadata.json
└── tracking/
    └── tracking_config.json         # btrack configuration

Creating Custom Model Collections

Using the CLI (Recommended)

Create model bundles directly from the command line:

# 1. Create a configuration file describing your models
cat > models_info.yml << EOF
bf_focus:
  model_path: "./models/bf_focus/model.pth"
  model_metadata: "./models/bf_focus/config.json"
  inferer_args:
    scale_method: "range01"
    patch_size: 256

fl_focus:
  model_path: "./models/fl_focus/model.pth"
  model_metadata: "./models/fl_focus/config.json"

segmentation:
  model_path: "./models/segmentation/model.pth"
  model_metadata: "./models/segmentation/config.json"

cell_classifier:
  model_path: "./models/classifier/model.onnx"
  model_metadata: "./models/classifier/config.json"

pi_classification:
  model_path: "./models/pi_classifier/model.joblib"

tracker:
  config_path: "./tracking/config.json"  # Optional
EOF

# 2. Create the bundle (date will be auto-inserted)
hitmictools bundle -i models_info.yml -o my_custom_collection.zip
# Creates: my_custom_collection_20251218.zip

# Disable auto-dating if you want exact filename
hitmictools bundle -i models_info.yml -o my_bundle.zip --no-auto-date

The bundle will include:

All model files with standardized naming
Metadata JSON files for each model
Internal config.yml with creation timestamp
Optional tracker configuration

Using the Standalone Script (Legacy)

For backward compatibility, the script interface is still available:

# Explicit output path (no auto-dating)
python scripts/create_model_bundle.py create \
    -i models_info.yml \
    -o my_custom_collection.zip

# Auto-dated output
python scripts/create_model_bundle.py create-mbundle \
    -i models_info.yml \
    -d ./output_directory/
# Creates: ./output_directory/model_collection_20251218.zip

2. Individual Model Specification (Advanced)

For development, testing, or custom pipelines, you can specify each model individually.

Complete Individual Model Configuration

# Do NOT include model_collection when using individual models

bf_focus:
  model_path: "/path/to/models/bf_focus/model.pth"
  model_metadata: "/path/to/models/bf_focus/config.json"
  inferer_args:
    scale_method: "range01"          # Scaling: "range01", "fixed_range", "none"
    patch_size: 256                  # Must be power of 2
    overlap_ratio: 0.25              # 0.0-0.5, higher = smoother but slower
    half_precision: true             # Use FP16 for faster inference
  scaler_args:
    pmin: 1.0                        # Percentile min for range01
    pmax: 99.8                       # Percentile max for range01

fl_focus:
  model_path: "/path/to/models/fl_focus/model.pth"
  model_metadata: "/path/to/models/fl_focus/config.json"
  inferer_args:
    scale_method: "fixed_range"
    patch_size: 256
    overlap_ratio: 0.25
    half_precision: true
  scaler_args:
    bit_depth: 12                    # For fixed_range scaling

segmentation:
  model_path: "/path/to/models/segmentation/model.pth"
  model_metadata: "/path/to/models/segmentation/config.json"
  inferer_args:
    scale_method: "none"             # Already normalized
    patch_size: 512
    overlap_ratio: 0.25
    half_precision: true

cell_classifier:
  model_path: "/path/to/models/cell_classifier/model.onnx"
  model_metadata: "/path/to/models/cell_classifier/config.json"
  model_args:
    batch_size: 512                  # Adjust based on GPU memory
    min_size: 128                    # Minimum object size to classify
  classes:
    0: "single-cell"
    1: "clump"
    2: "noise"
    3: "off-focus"
    4: "joint-cell"

pi_classification:
  pi_classifier_path: "/path/to/models/pi_classifier/model.joblib"
  # scikit-learn model, no additional config needed

# Optional: Out-of-focus detector
oof_detector:
  model_path: "/path/to/models/oof_detector/model.pth"
  model_metadata: "/path/to/models/oof_detector/config.json"

Model Loading Details

The pipeline loads models using load_model_fromdict() which supports:

Valid model keys:

bf_focus - Brightfield focus restoration
fl_focus - Fluorescence focus restoration
segmentation - Cell segmentation
cell_classifier - Cell type/quality classification
pi_classification - PI staining classification
oof_detector - Out-of-focus detection
sc_segmenter - Single-cell instance segmentation (RT-DETR)

Supported model formats:

PyTorch (.pth, .pt) - Most models
ONNX (.onnx) - Cross-platform inference
scikit-learn (.joblib) - Traditional ML models

3. Model Architectures

Focus Restoration Models

NAFNet (Nonlinear Activation Free Network)

Architecture: U-Net style with NAF blocks
Input: Single-channel grayscale (brightfield or fluorescence)
Output: Focus-restored image
Typical size: ~10-50 MB
Inference: ~0.5-2 seconds per frame (GPU)

MonaiUnet

Architecture: MONAI U-Net
Alternative to NAFNet
Similar performance, different training approach

Segmentation Models

MonaiUnet for Segmentation

Architecture: MONAI U-Net with instance segmentation head
Input: Single-channel brightfield
Output: Instance segmentation masks
Typical size: ~50-100 MB

RT-DETR (Real-Time Detection Transformer)

Architecture: Transformer-based object detection
Used in ASCT_scsegm pipeline
Input: Single-channel brightfield
Output: Bounding boxes + masks
Typical size: ~100-200 MB
Better for crowded/overlapping cells

Classification Models

FlexResNet

Architecture: Custom ResNet variant
Input: Cropped cell images (typically 128x128)
Output: Class probabilities
Classes: single-cell, clump, noise, off-focus, joint-cell
Typical size: ~20-50 MB

PI Classifier

Architecture: Random Forest or Logistic Regression
Input: Intensity features from FL channel
Output: PI positive/negative
Typical size: <1 MB

4. Resource Management

HiTMicTools includes sophisticated GPU/CPU memory management for multi-process environments.

The ReserveResource System

ReserveResource is a context manager that prevents GPU memory over-subscription:

from HiTMicTools.resource_management.reserveresource import ReserveResource
import torch

# Reserve 8 GB of GPU memory
with ReserveResource(torch.device("cuda:0"), required_gb=8.0, logger=logger):
    # Run your analysis here
    # Other processes will queue if insufficient memory
    run_pipeline()

How It Works

Booking System: Creates JSON file tracking memory usage
- File location: TMPDIR/memory_bookings_cuda0.json
- Tracks total reserved memory per device
Queueing: When memory unavailable, processes wait in queue
- Fair allocation (first-come, first-served)
- Periodic checks for available memory
- Automatic cleanup on exit
Cross-Platform: Works on macOS (MPS), Linux (CUDA), Windows (CUDA/CPU)

Configuration for Multi-Process

When running multiple processes:

pipeline_setup:
  parallel_processing: true
  num_workers: 3              # Number of concurrent processes

# Internally, each process reserves memory:
# - Focus restoration: ~4-6 GB
# - Segmentation: ~6-8 GB
# - Classification: ~2-4 GB
# Total per process: ~12-18 GB peak

Important: Set num_workers based on available VRAM:

16 GB GPU: num_workers: 1-2
24 GB GPU: num_workers: 2-3
40 GB GPU: num_workers: 3-4

Memory Logging

Track memory usage during processing:

from HiTMicTools.resource_management.memlogger import MemoryLogger

logger = MemoryLogger(log_dir="./logs", prefix="analysis")

# Log memory at specific points
logger.info("Starting segmentation", show_memory=True, cuda=True)
# Output: [INFO] Starting segmentation | RAM: 12.3 GB | VRAM: 8.5 GB

Cleanup and Cache Management

All models inherit from BaseModel which provides cleanup:

# Manual cleanup (usually automatic)
model.cleanup()

# This calls:
# - torch.cuda.empty_cache()
# - del self.model
# - gc.collect()

The pipeline automatically calls cleanup between stages.

5. Model Performance and Optimization

Inference Speed Optimization

Use Half Precision (FP16)

inferer_args:
  half_precision: true    # 2x faster, ~50% less memory

Adjust Patch Size

inferer_args:
  patch_size: 256         # Smaller = slower but less memory
  # Options: 128, 256, 512, 1024

Reduce Overlap

inferer_args:
  overlap_ratio: 0.125    # Less overlap = faster but more artifacts
  # Range: 0.0 - 0.5

Batch Size for Classifiers

model_args:
  batch_size: 1024        # Larger = faster but more memory

Typical Performance Metrics

For a 2048x2048 image, single frame:

Task	GPU (RTX 4090)	CPU	VRAM
Focus Restoration (BF)	0.8s	15s	4 GB
Focus Restoration (FL)	0.8s	15s	4 GB
Segmentation	1.2s	25s	6 GB
Classification (500 cells)	0.3s	2s	2 GB
Total per frame	~3s	~60s	~12 GB peak

Multi-frame movie (100 frames):

GPU: ~5-8 minutes
CPU: ~1.5-2 hours

Model Versioning and Reproducibility

Track model versions:

# In your config, add comments
models:
  model_collection: "./models/model_collection_tracking_20250529.zip"
  # Version: 2025-05-29
  # Training date: 2025-05-15
  # Training dataset: ASCT_batch_042
  # Validation accuracy: 98.5%

Save model metadata:

# Model metadata JSON includes:
{
  "model_type": "NAFNet",
  "architecture": "unet_style",
  "input_channels": 1,
  "output_channels": 1,
  "training_date": "2025-05-15",
  "training_dataset": "ASCT_batch_042",
  "validation_metrics": {
    "psnr": 32.5,
    "ssim": 0.95
  }
}

6. Troubleshooting Models

Model Loading Errors

“Model file not found”

# Check paths
import os
print(os.path.exists("./models/model_collection.zip"))

# Use absolute paths
models:
  model_collection: "/full/path/to/model_collection.zip"

“Invalid model metadata”

# Metadata must match model architecture
# Check model_metadata.json:
{
  "model_type": "NAFNet",  # Must match actual model
  "input_channels": 1,      # Must be correct
  ...
}

“CUDA out of memory”

# Solutions:
# 1. Enable half precision
inferer_args:
  half_precision: true

# 2. Reduce patch size
inferer_args:
  patch_size: 128

# 3. Reduce batch size
model_args:
  batch_size: 256

# 4. Use fewer workers
pipeline_setup:
  num_workers: 1

Model Performance Issues

Focus restoration artifacts

# Increase overlap
inferer_args:
  overlap_ratio: 0.35     # Default 0.25

# Adjust scaling
scaler_args:
  pmin: 0.5               # Less aggressive clipping
  pmax: 99.9

Poor segmentation

# Check preprocessing
pipeline_setup:
  focus_correction: true   # Ensure enabled
  method: "basicpy_fl"     # Try different methods

# Verify correct channel
pipeline_setup:
  reference_channel: 0     # Should be brightfield

Misclassification

# Adjust minimum object size
model_args:
  min_size: 100            # Filter out smaller objects

# Check class definitions match training
classes:
  0: "single-cell"         # Must match model training
  1: "clump"
  ...

7. Model Storage Best Practices

File Organization

Recommended structure:

models/
├── collections/
│   ├── model_collection_tracking_20250529.zip
│   ├── model_collection_scsegm_20251106.zip
│   └── README.md                    # Document model versions
├── individual/
│   ├── bf_focus/
│   │   ├── v1.0/
│   │   │   ├── model.pth
│   │   │   └── config.json
│   │   └── v2.0/
│   │       ├── model.pth
│   │       └── config.json
│   └── ...
└── experimental/
    └── ...                          # Models under development

Version Control

DO:

Store model collections on shared filesystem or cloud storage
Document model versions in README
Tag configs with model version information
Keep checksums of model files for verification

DO NOT:

Commit large model files to git (use .gitignore)
Overwrite production models without versioning
Mix model versions in the same collection
Share models without documentation

Model Distribution

For sharing models:

# 1. Create bundle with CLI
hitmictools bundle -i models_info.yml -o model_collection_v1.0.zip
# Creates: model_collection_v1.0_20251218.zip (with auto-dating)

# 2. Calculate checksum
sha256sum model_collection_v1.0_20251218.zip > model_collection_v1.0_20251218.zip.sha256

# 3. Document
echo "Model Collection v1.0" > README.txt
echo "Creation date: 2025-12-18" >> README.txt
echo "Training date: 2025-05-29" >> README.txt
echo "Dataset: ASCT_training_set_042" >> README.txt
cat model_collection_v1.0_20251218.zip.sha256 >> README.txt

# 4. Share via cloud/network storage
# DO NOT email large files

Summary

Model management in HiTMicTools:

Use model collections for production (simplest approach)
Individual models for development and testing
ReserveResource for GPU memory management
MemoryLogger for monitoring resource usage
Half precision and patch size tuning for performance
Version control for reproducibility

For model training and development, see the experimental notebooks in experiments/.