WandB Offline Training Workflow¶
This document explains how to use Weights & Biases (WandB) in offline mode for training environments without internet connectivity.
Overview¶
The enhanced WandB integration supports three modes: - online: Default mode, requires internet and WandB API key - offline: Stores logs locally for later syncing - disabled: Completely disables WandB logging
Quick Start - Offline Training¶
1. Use Offline Configuration¶
# Train with offline configuration
python scripts/train.py --config configs/offline.yaml
# Or override mode in any config
python scripts/train.py --wandb-mode offline
2. Monitor Offline Runs¶
# List all offline runs
python scripts/sync_wandb.py --list
# Check specific run
ls -la ./wandb/offline-run-*
3. Sync to WandB Cloud¶
# Set your API key
export WANDB_API_KEY=your_api_key_here
# Sync all offline runs
python scripts/sync_wandb.py
# Sync specific run
python scripts/sync_wandb.py --run-id offline-run-20231120_123456-abc123
Configuration Options¶
YAML Configuration¶
logging:
use_wandb: true
wandb_mode: "offline" # online, offline, disabled
wandb_dir: "./wandb" # Directory for offline storage
wandb_project: "mistral-ner"
wandb_name: "offline-experiment"
wandb_tags: ["offline", "experiment"]
wandb_notes: "Offline training run"
wandb_resume: "allow" # allow, must, never, auto
wandb_run_id: null # For resuming specific runs
Command Line Override¶
# Basic offline training
python scripts/train.py --wandb-mode offline
# Offline with custom settings
python scripts/train.py \
--wandb-mode offline \
--wandb-name "my-offline-experiment" \
--wandb-tags "offline" "baseline" \
--wandb-notes "Testing offline functionality"
# Disable WandB completely
python scripts/train.py --no-wandb
Environment Variables¶
All WandB settings can be controlled via environment variables:
export WANDB_MODE=offline
export WANDB_DIR=./wandb
export WANDB_PROJECT=mistral-ner
export WANDB_NAME=offline-experiment
export WANDB_TAGS=offline,experiment
export WANDB_NOTES="Offline training run"
Offline Workflow¶
1. Training Phase (No Internet Required)¶
# Start offline training
python scripts/train.py --config configs/offline.yaml
# Training runs normally, logs stored locally in ./wandb/
# No internet connection required
2. Management Phase¶
# List offline runs with details
python scripts/sync_wandb.py --list
# Expected output:
# Found 2 offline runs:
# - offline-run-20231120_123456-abc123 (45.2 MB)
# - offline-run-20231120_134567-def456 (38.7 MB)
# Preview what would be synced
python scripts/sync_wandb.py --dry-run
3. Sync Phase (Internet Required)¶
# Set API key (get from https://wandb.ai/settings)
export WANDB_API_KEY=your_api_key_here
# Sync all runs
python scripts/sync_wandb.py
# Or sync specific run
python scripts/sync_wandb.py --run-id offline-run-20231120_123456-abc123
Resuming Offline Runs¶
Resume from Checkpoint¶
# Resume specific run
python scripts/train.py \
--wandb-mode offline \
--wandb-run-id abc123def456 \
--wandb-resume must \
--resume-from-checkpoint ./mistral-ner-finetuned/checkpoint-100
Continue Offline Series¶
# Start new run with continuation tags
python scripts/train.py \
--wandb-mode offline \
--wandb-name "experiment-part-2" \
--wandb-tags "offline" "continuation" \
--wandb-notes "Continuing from previous offline run"
Troubleshooting¶
No API Key Warning¶
If you see this warning during online mode:
Solution: The system automatically falls back to offline mode. Set your API key when ready to sync:
Sync Failures¶
# Check run integrity
ls -la ./wandb/offline-run-*/
# Try syncing with verbose output
wandb sync ./wandb/offline-run-20231120_123456-abc123 --verbose
# Clean corrupted run (last resort)
rm -rf ./wandb/offline-run-corrupted-id
Large Offline Runs¶
For runs with large artifacts:
# Check sizes before syncing
python scripts/sync_wandb.py --list
# Sync specific large runs one at a time
python scripts/sync_wandb.py --run-id large-run-id
Best Practices¶
1. Directory Management¶
# Keep offline runs organized
export WANDB_DIR=/data/wandb-offline
# Or use date-based directories
export WANDB_DIR=./wandb/$(date +%Y-%m-%d)
2. Tagging Strategy¶
wandb_tags:
- "offline" # Always tag offline runs
- "experiment-v1" # Experiment version
- "gpu-count-4" # Hardware info
- "batch-size-32" # Key hyperparameters
3. Regular Sync Schedule¶
# Create sync script for regular intervals
#!/bin/bash
# sync-daily.sh
export WANDB_API_KEY=your_key
python scripts/sync_wandb.py
echo "Sync completed at $(date)"
4. Storage Cleanup¶
# After successful sync, clean old runs
# (Make sure sync was successful first!)
find ./wandb -name "offline-run-*" -mtime +7 -exec rm -rf {} \;
Integration Examples¶
Slurm Job Script¶
#!/bin/bash
#SBATCH --job-name=mistral-ner-offline
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
# No internet on compute nodes - use offline mode
python scripts/train.py \
--config configs/offline.yaml \
--wandb-name "slurm-job-${SLURM_JOB_ID}"
# Sync after job completes (if login node has internet)
if [ $? -eq 0 ]; then
python scripts/sync_wandb.py
fi
Docker Container¶
# Dockerfile
FROM python:3.11
# Copy offline config
COPY configs/offline.yaml /app/configs/
COPY scripts/ /app/scripts/
# Run offline by default
CMD ["python", "scripts/train.py", "--config", "configs/offline.yaml"]
Jupyter Notebook¶
# notebook_training.ipynb
import os
os.environ['WANDB_MODE'] = 'offline'
from src.config import Config
config = Config.from_yaml('configs/offline.yaml')
config.setup_wandb()
# Training proceeds normally, logs stored locally
Advanced Features¶
Custom Offline Directory Structure¶
# Custom directory naming
import datetime
wandb_dir = f"./experiments/{datetime.date.today()}/wandb"
config.logging.wandb_dir = wandb_dir
Batch Sync Operations¶
# Sync runs matching pattern
for run in ./wandb/offline-run-*experiment*; do
python scripts/sync_wandb.py --run-id $(basename $run)
done
Monitoring Disk Usage¶
This workflow enables seamless ML experimentation in offline environments while maintaining full WandB integration capabilities.