Model Management and Resource Control

This guide explains how to manage deep learning models and control GPU/CPU resources in HiTMicTools.

Overview

HiTMicTools uses multiple neural network models for different analysis tasks:

  • Focus Restoration: NAFNet models for brightfield and fluorescence channels

  • Segmentation: MonaiUnet or RT-DETR for cell/object detection

  • Classification: ResNet-based classifiers for cell type and quality

  • PI Classification: Models for propidium iodide staining detection

  • Out-of-Focus Detection: Quality control models

2. Individual Model Specification (Advanced)

For development, testing, or custom pipelines, you can specify each model individually.

Complete Individual Model Configuration

# Do NOT include model_collection when using individual models

bf_focus:
  model_path: "/path/to/models/bf_focus/model.pth"
  model_metadata: "/path/to/models/bf_focus/config.json"
  inferer_args:
    scale_method: "range01"          # Scaling: "range01", "fixed_range", "none"
    patch_size: 256                  # Must be power of 2
    overlap_ratio: 0.25              # 0.0-0.5, higher = smoother but slower
    half_precision: true             # Use FP16 for faster inference
  scaler_args:
    pmin: 1.0                        # Percentile min for range01
    pmax: 99.8                       # Percentile max for range01

fl_focus:
  model_path: "/path/to/models/fl_focus/model.pth"
  model_metadata: "/path/to/models/fl_focus/config.json"
  inferer_args:
    scale_method: "fixed_range"
    patch_size: 256
    overlap_ratio: 0.25
    half_precision: true
  scaler_args:
    bit_depth: 12                    # For fixed_range scaling

segmentation:
  model_path: "/path/to/models/segmentation/model.pth"
  model_metadata: "/path/to/models/segmentation/config.json"
  inferer_args:
    scale_method: "none"             # Already normalized
    patch_size: 512
    overlap_ratio: 0.25
    half_precision: true

cell_classifier:
  model_path: "/path/to/models/cell_classifier/model.onnx"
  model_metadata: "/path/to/models/cell_classifier/config.json"
  model_args:
    batch_size: 512                  # Adjust based on GPU memory
    min_size: 128                    # Minimum object size to classify
  classes:
    0: "single-cell"
    1: "clump"
    2: "noise"
    3: "off-focus"
    4: "joint-cell"

pi_classification:
  pi_classifier_path: "/path/to/models/pi_classifier/model.joblib"
  # scikit-learn model, no additional config needed

# Optional: Out-of-focus detector
oof_detector:
  model_path: "/path/to/models/oof_detector/model.pth"
  model_metadata: "/path/to/models/oof_detector/config.json"

Model Loading Details

The pipeline loads models using load_model_fromdict() which supports:

Valid model keys:

  • bf_focus - Brightfield focus restoration

  • fl_focus - Fluorescence focus restoration

  • segmentation - Cell segmentation

  • cell_classifier - Cell type/quality classification

  • pi_classification - PI staining classification

  • oof_detector - Out-of-focus detection

  • sc_segmenter - Single-cell instance segmentation (RT-DETR)

Supported model formats:

  • PyTorch (.pth, .pt) - Most models

  • ONNX (.onnx) - Cross-platform inference

  • scikit-learn (.joblib) - Traditional ML models

3. Model Architectures

Focus Restoration Models

NAFNet (Nonlinear Activation Free Network)

  • Architecture: U-Net style with NAF blocks

  • Input: Single-channel grayscale (brightfield or fluorescence)

  • Output: Focus-restored image

  • Typical size: ~10-50 MB

  • Inference: ~0.5-2 seconds per frame (GPU)

MonaiUnet

  • Architecture: MONAI U-Net

  • Alternative to NAFNet

  • Similar performance, different training approach

Segmentation Models

MonaiUnet for Segmentation

  • Architecture: MONAI U-Net with instance segmentation head

  • Input: Single-channel brightfield

  • Output: Instance segmentation masks

  • Typical size: ~50-100 MB

RT-DETR (Real-Time Detection Transformer)

  • Architecture: Transformer-based object detection

  • Used in ASCT_scsegm pipeline

  • Input: Single-channel brightfield

  • Output: Bounding boxes + masks

  • Typical size: ~100-200 MB

  • Better for crowded/overlapping cells

Classification Models

FlexResNet

  • Architecture: Custom ResNet variant

  • Input: Cropped cell images (typically 128x128)

  • Output: Class probabilities

  • Classes: single-cell, clump, noise, off-focus, joint-cell

  • Typical size: ~20-50 MB

PI Classifier

  • Architecture: Random Forest or Logistic Regression

  • Input: Intensity features from FL channel

  • Output: PI positive/negative

  • Typical size: <1 MB

4. Resource Management

HiTMicTools includes sophisticated GPU/CPU memory management for multi-process environments.

The ReserveResource System

ReserveResource is a context manager that prevents GPU memory over-subscription:

from HiTMicTools.resource_management.reserveresource import ReserveResource
import torch

# Reserve 8 GB of GPU memory
with ReserveResource(torch.device("cuda:0"), required_gb=8.0, logger=logger):
    # Run your analysis here
    # Other processes will queue if insufficient memory
    run_pipeline()

How It Works

  1. Booking System: Creates JSON file tracking memory usage

    • File location: TMPDIR/memory_bookings_cuda0.json

    • Tracks total reserved memory per device

  2. Queueing: When memory unavailable, processes wait in queue

    • Fair allocation (first-come, first-served)

    • Periodic checks for available memory

    • Automatic cleanup on exit

  3. Cross-Platform: Works on macOS (MPS), Linux (CUDA), Windows (CUDA/CPU)

Configuration for Multi-Process

When running multiple processes:

pipeline_setup:
  parallel_processing: true
  num_workers: 3              # Number of concurrent processes

# Internally, each process reserves memory:
# - Focus restoration: ~4-6 GB
# - Segmentation: ~6-8 GB
# - Classification: ~2-4 GB
# Total per process: ~12-18 GB peak

Important: Set num_workers based on available VRAM:

  • 16 GB GPU: num_workers: 1-2

  • 24 GB GPU: num_workers: 2-3

  • 40 GB GPU: num_workers: 3-4

Memory Logging

Track memory usage during processing:

from HiTMicTools.resource_management.memlogger import MemoryLogger

logger = MemoryLogger(log_dir="./logs", prefix="analysis")

# Log memory at specific points
logger.info("Starting segmentation", show_memory=True, cuda=True)
# Output: [INFO] Starting segmentation | RAM: 12.3 GB | VRAM: 8.5 GB

Cleanup and Cache Management

All models inherit from BaseModel which provides cleanup:

# Manual cleanup (usually automatic)
model.cleanup()

# This calls:
# - torch.cuda.empty_cache()
# - del self.model
# - gc.collect()

The pipeline automatically calls cleanup between stages.

5. Model Performance and Optimization

Inference Speed Optimization

Use Half Precision (FP16)

inferer_args:
  half_precision: true    # 2x faster, ~50% less memory

Adjust Patch Size

inferer_args:
  patch_size: 256         # Smaller = slower but less memory
  # Options: 128, 256, 512, 1024

Reduce Overlap

inferer_args:
  overlap_ratio: 0.125    # Less overlap = faster but more artifacts
  # Range: 0.0 - 0.5

Batch Size for Classifiers

model_args:
  batch_size: 1024        # Larger = faster but more memory

Typical Performance Metrics

For a 2048x2048 image, single frame:

Task

GPU (RTX 4090)

CPU

VRAM

Focus Restoration (BF)

0.8s

15s

4 GB

Focus Restoration (FL)

0.8s

15s

4 GB

Segmentation

1.2s

25s

6 GB

Classification (500 cells)

0.3s

2s

2 GB

Total per frame

~3s

~60s

~12 GB peak

Multi-frame movie (100 frames):

  • GPU: ~5-8 minutes

  • CPU: ~1.5-2 hours

Model Versioning and Reproducibility

Track model versions:

# In your config, add comments
models:
  model_collection: "./models/model_collection_tracking_20250529.zip"
  # Version: 2025-05-29
  # Training date: 2025-05-15
  # Training dataset: ASCT_batch_042
  # Validation accuracy: 98.5%

Save model metadata:

# Model metadata JSON includes:
{
  "model_type": "NAFNet",
  "architecture": "unet_style",
  "input_channels": 1,
  "output_channels": 1,
  "training_date": "2025-05-15",
  "training_dataset": "ASCT_batch_042",
  "validation_metrics": {
    "psnr": 32.5,
    "ssim": 0.95
  }
}

6. Troubleshooting Models

Model Loading Errors

“Model file not found”

# Check paths
import os
print(os.path.exists("./models/model_collection.zip"))

# Use absolute paths
models:
  model_collection: "/full/path/to/model_collection.zip"

“Invalid model metadata”

# Metadata must match model architecture
# Check model_metadata.json:
{
  "model_type": "NAFNet",  # Must match actual model
  "input_channels": 1,      # Must be correct
  ...
}

“CUDA out of memory”

# Solutions:
# 1. Enable half precision
inferer_args:
  half_precision: true

# 2. Reduce patch size
inferer_args:
  patch_size: 128

# 3. Reduce batch size
model_args:
  batch_size: 256

# 4. Use fewer workers
pipeline_setup:
  num_workers: 1

Model Performance Issues

Focus restoration artifacts

# Increase overlap
inferer_args:
  overlap_ratio: 0.35     # Default 0.25

# Adjust scaling
scaler_args:
  pmin: 0.5               # Less aggressive clipping
  pmax: 99.9

Poor segmentation

# Check preprocessing
pipeline_setup:
  focus_correction: true   # Ensure enabled
  method: "basicpy_fl"     # Try different methods

# Verify correct channel
pipeline_setup:
  reference_channel: 0     # Should be brightfield

Misclassification

# Adjust minimum object size
model_args:
  min_size: 100            # Filter out smaller objects

# Check class definitions match training
classes:
  0: "single-cell"         # Must match model training
  1: "clump"
  ...

7. Model Storage Best Practices

File Organization

Recommended structure:

models/
├── collections/
│   ├── model_collection_tracking_20250529.zip
│   ├── model_collection_scsegm_20251106.zip
│   └── README.md                    # Document model versions
├── individual/
│   ├── bf_focus/
│   │   ├── v1.0/
│   │   │   ├── model.pth
│   │   │   └── config.json
│   │   └── v2.0/
│   │       ├── model.pth
│   │       └── config.json
│   └── ...
└── experimental/
    └── ...                          # Models under development

Version Control

DO:

  • Store model collections on shared filesystem or cloud storage

  • Document model versions in README

  • Tag configs with model version information

  • Keep checksums of model files for verification

DO NOT:

  • Commit large model files to git (use .gitignore)

  • Overwrite production models without versioning

  • Mix model versions in the same collection

  • Share models without documentation

Model Distribution

For sharing models:

# 1. Create bundle with CLI
hitmictools bundle -i models_info.yml -o model_collection_v1.0.zip
# Creates: model_collection_v1.0_20251218.zip (with auto-dating)

# 2. Calculate checksum
sha256sum model_collection_v1.0_20251218.zip > model_collection_v1.0_20251218.zip.sha256

# 3. Document
echo "Model Collection v1.0" > README.txt
echo "Creation date: 2025-12-18" >> README.txt
echo "Training date: 2025-05-29" >> README.txt
echo "Dataset: ASCT_training_set_042" >> README.txt
cat model_collection_v1.0_20251218.zip.sha256 >> README.txt

# 4. Share via cloud/network storage
# DO NOT email large files

Summary

Model management in HiTMicTools:

  • Use model collections for production (simplest approach)

  • Individual models for development and testing

  • ReserveResource for GPU memory management

  • MemoryLogger for monitoring resource usage

  • Half precision and patch size tuning for performance

  • Version control for reproducibility

For model training and development, see the experimental notebooks in experiments/.