Hyperparameter Optimization for Mistral NER¶

This document describes the hyperparameter optimization feature that combines Bayesian Optimization (via Optuna) with Hyperband (via Ray Tune) for efficient and intelligent hyperparameter search.

🎯 Overview¶

The hyperparameter optimization system provides multiple strategies for finding optimal configurations:

🧠 Bayesian Optimization: Intelligent search using Optuna's TPE algorithm
⚡ Hyperband: Efficient early stopping using Ray Tune's ASHA scheduler
🔥 Combined Strategy: Best of both worlds - intelligent search + smart stopping
🎲 Random Search: Baseline comparison strategy

🚀 Quick Start¶

1. Enable Hyperparameter Optimization¶

Add to your config YAML:

hyperopt:
  enabled: true
  strategy: "combined"  # Recommended: Bayesian + Hyperband
  num_trials: 50
  search_space:
    learning_rate:
      type: "loguniform"
      low: 1e-5
      high: 1e-3
    lora_r:
      type: "choice"
      choices: [8, 16, 32, 64]

2. Run Optimization¶

# Using the main training script
python scripts/train.py --config configs/hyperopt.yaml

# Or using the dedicated hyperopt script
python scripts/hyperopt.py --config configs/hyperopt.yaml

3. Get Results¶

The optimizer will: 1. 🔍 Search the hyperparameter space intelligently 2. ⚡ Stop poor-performing trials early 3. 📊 Report the best configuration 4. 🏆 Train a final model with optimal hyperparameters

📋 Configuration Guide¶

Strategy Options¶

hyperopt:
  strategy: "combined"    # Recommended
  # Options: "optuna", "asha", "combined", "random"

Strategy Comparison¶

Strategy	Search Method	Early Stopping	Best For
`combined`	Bayesian (TPE)	ASHA	Recommended - Best performance
`optuna`	Bayesian (TPE)	None	Small search spaces
`asha`	Random	ASHA	Large search spaces
`random`	Random	None	Baseline comparison

Search Space Definition¶

search_space:
  # Continuous parameters
  learning_rate:
    type: "loguniform"
    low: 1e-5
    high: 1e-3

  warmup_ratio:
    type: "uniform"
    low: 0.0
    high: 0.1

  # Discrete parameters
  lora_r:
    type: "choice"
    choices: [8, 16, 32, 64]

  per_device_train_batch_size:
    type: "choice"
    choices: [4, 8, 16]

  # Integer parameters
  num_epochs:
    type: "int"
    low: 1
    high: 10

Parameter Types¶

uniform: Continuous uniform distribution
loguniform: Log-uniform distribution (good for learning rates)
choice: Discrete categorical choices
int: Integer range
logint: Log-scale integer range

Trial Settings¶

hyperopt:
  num_trials: 50           # Total trials to run
  max_concurrent: 4        # Parallel trials
  timeout: 3600           # Max time in seconds

  # Metric to optimize
  metric: "eval_f1"
  mode: "max"             # "max" or "min"

Advanced Configuration¶

hyperopt:
  # Optuna settings
  optuna_sampler: "TPE"           # TPE, CMA, Random, GPSampler
  optuna_pruner: "median"         # median, hyperband, none
  optuna_n_startup_trials: 10

  # ASHA settings
  asha_max_t: 100                 # Max training iterations
  asha_grace_period: 10           # Min iterations before pruning
  asha_reduction_factor: 3        # Aggressiveness of pruning

  # Storage and logging
  results_dir: "./hyperopt_results"
  study_name: "mistral_ner_optimization"
  log_to_file: true

🖥️ Usage Examples¶

Command Line Interface¶

# Basic optimization
python scripts/hyperopt.py --config configs/hyperopt.yaml

# Override strategy
python scripts/hyperopt.py --strategy combined --num-trials 20

# Save best configuration
python scripts/hyperopt.py --save-best-config best_hyperparams.yaml

# Dry run (show config and exit)
python scripts/hyperopt.py --dry-run

# Debug mode
python scripts/hyperopt.py --debug

Programmatic Usage¶

from src.config import Config
from src.hyperopt import HyperparameterOptimizer, create_objective_function
from src.hyperopt.utils import create_ray_tune_search_space

# Load configuration
config = Config.from_yaml("configs/hyperopt.yaml")

# Create optimizer
with HyperparameterOptimizer(config.hyperopt) as optimizer:
    # Create objective function
    objective_func = create_objective_function(
        base_config=config,
        hyperopt_config=config.hyperopt,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        data_collator=data_collator,
    )

    # Create search space
    search_space = create_ray_tune_search_space(config.hyperopt)

    # Run optimization
    results = optimizer.optimize(objective_func, search_space, config)

    # Get best result
    best_result = results.get_best_result()
    print(f"Best F1: {best_result.metrics['eval_f1']:.4f}")
    print(f"Best config: {best_result.config}")

📊 Results and Analysis¶

Best Configuration¶

After optimization completes, you'll see:

==========================================
HYPERPARAMETER OPTIMIZATION COMPLETED
==========================================
Total trials completed: 50
Best eval_f1: 0.8742
Best hyperparameters:
  learning_rate: 0.00034
  lora_r: 32
  per_device_train_batch_size: 8
  warmup_ratio: 0.045
  weight_decay: 0.0021
==========================================

Saving Results¶

# Save best configuration to file
python scripts/hyperopt.py --save-best-config best_config.yaml

Results Directory Structure¶

hyperopt_results/
├── study_name/
│   ├── trial_*/
│   │   ├── checkpoint.json
│   │   └── result.json
│   └── experiment_state.json
├── hyperopt.log
└── best_config.yaml

🔧 Distributed Optimization¶

Ray Cluster Setup¶

# Start Ray cluster
ray start --head --port=6379

# In other machines
ray start --address="head_node_ip:6379"

Configuration for Distributed¶

hyperopt:
  ray_address: "ray://head_node_ip:10001"
  max_concurrent: 16        # More parallel trials
  resources_per_trial:
    cpu: 2.0
    gpu: 1.0

🎛️ Strategy Details¶

Combined Strategy (Recommended)¶

The combined strategy uses: - Optuna's TPE for intelligent hyperparameter suggestions - Ray Tune's ASHA for efficient early stopping

hyperopt:
  strategy: "combined"
  optuna_enabled: true
  asha_enabled: true

  # TPE will suggest promising hyperparameters
  optuna_sampler: "TPE"

  # ASHA will stop poor trials early
  asha_grace_period: 1
  asha_reduction_factor: 3

Benefits: - 🧠 Smart hyperparameter exploration - ⚡ 50-80% reduction in computation time - 🎯 Better final performance than random search - 📈 Efficient convergence

Optuna-Only Strategy¶

hyperopt:
  strategy: "optuna"
  optuna_sampler: "TPE"    # Tree-structured Parzen Estimator

Best for: - Small search spaces - When you want to run all trials to completion - CPU-only environments

ASHA-Only Strategy¶

hyperopt:
  strategy: "asha"
  asha_max_t: 100
  asha_grace_period: 10

Best for: - Large search spaces - When early stopping is most important - Quick exploration

🐛 Troubleshooting¶

Common Issues¶

Ray Connection Issues¶

# Check Ray status
ray status

# Reset Ray
ray stop
ray start --head

Memory Issues¶

hyperopt:
  max_concurrent: 2        # Reduce parallel trials
  resources_per_trial:
    gpu: 0.5              # Share GPU memory

Slow Convergence¶

hyperopt:
  optuna_n_startup_trials: 20    # More random trials first
  asha_grace_period: 5          # Allow more training before stopping

Debug Mode¶

# Enable debug logging
python scripts/hyperopt.py --debug

# Check specific trial logs
tail -f hyperopt_results/study_name/trial_*/logs.txt

📈 Performance Tips¶

1. Efficient Search Space Design¶

# Good: Log-uniform for learning rates
learning_rate:
  type: "loguniform"
  low: 1e-5
  high: 1e-3

# Good: Discrete choices for model architecture
lora_r:
  type: "choice"
  choices: [8, 16, 32, 64]

# Avoid: Too large search spaces
# batch_size:
#   type: "int"
#   low: 1
#   high: 1000  # Too large!

2. Optimal Trial Settings¶

# For development
num_trials: 20
max_concurrent: 4

# For production
num_trials: 100
max_concurrent: 8

3. Early Stopping Configuration¶

# Aggressive early stopping (faster, may miss good configs)
asha_grace_period: 1
asha_reduction_factor: 4

# Conservative early stopping (slower, more thorough)
asha_grace_period: 5
asha_reduction_factor: 2

🔗 Integration with Existing Workflow¶

With Regular Training¶

# Run hyperparameter optimization first
python scripts/hyperopt.py --save-best-config best.yaml

# Then train with best configuration
python scripts/train.py --config best.yaml

With WandB Integration¶

The optimizer automatically: - 🚫 Disables WandB for individual trials (reduces clutter) - ✅ Enables WandB for final training with best hyperparameters - 📊 Logs optimization summary to main project

With Existing Configs¶

# Your existing config
model:
  model_name: "mistralai/Mistral-7B-v0.3"
  # ... other settings

training:
  num_train_epochs: 5
  # ... other settings

# Add hyperopt section
hyperopt:
  enabled: true
  strategy: "combined"
  # ... hyperopt settings

📊 Case Studies¶

Case Study 1: Optimizing for CoNLL-2003¶

Goal: Maximize F1 score on CoNLL-2003 English dataset

Setup:

hyperopt:
  enabled: true
  strategy: "combined"
  num_trials: 100
  metric: "eval_f1"
  mode: "max"

  search_space:
    learning_rate:
      type: "loguniform"
      low: 1e-5
      high: 5e-4

    lora_r:
      type: "choice"
      choices: [16, 32, 64, 128]

    lora_alpha:
      type: "choice"
      choices: [16, 32, 64, 128]

    lora_dropout:
      type: "uniform"
      low: 0.0
      high: 0.2

    warmup_ratio:
      type: "uniform"
      low: 0.0
      high: 0.1

    weight_decay:
      type: "loguniform"
      low: 1e-5
      high: 1e-2

Results:

Trial	Learning Rate	LoRA r	LoRA α	Dropout	F1 Score	Time (min)
1	2.3e-4	16	32	0.05	88.5%	45
15	1.8e-4	32	64	0.08	89.7%	52
34	3.1e-4	64	64	0.05	90.3%	68
67	2.7e-4	32	64	0.05	91.2%	51
89	2.9e-4	32	32	0.10	90.8%	49

Insights: - Optimal learning rate: 2.7e-4 (higher than default 2e-4) - Best LoRA configuration: r=32, alpha=64 (2:1 ratio) - Low dropout (0.05) works better than higher values - Convergence achieved after ~70 trials

Case Study 2: Multi-Dataset Optimization¶

Goal: Find best hyperparameters for combined CoNLL + OntoNotes training

Challenge: Different datasets have different characteristics

Setup:

data:
  dataset_configs:
    - name: conll2003
      weight: 1.0
    - name: ontonotes
      weight: 1.0
  mixing_strategy: interleave

hyperopt:
  enabled: true
  strategy: "combined"
  num_trials: 80

  search_space:
    # Standard parameters
    learning_rate:
      type: "loguniform"
      low: 5e-6
      high: 2e-4

    # Dataset-specific optimization
    focal_gamma:
      type: "uniform"
      low: 1.0
      high: 4.0

    loss_type:
      type: "choice"
      choices: ["focal", "batch_balanced_focal", "class_balanced"]

    # Mixing parameters
    interleave_probs:
      type: "choice"
      choices: [[0.5, 0.5], [0.6, 0.4], [0.7, 0.3]]

Results:

Best configuration found: - Learning rate: 8.9e-5 (lower due to more data) - Loss: batch_balanced_focal with gamma=2.3 - Interleave probabilities: [0.6, 0.4] (slight CoNLL bias) - Final F1: 89.8% (CoNLL), 87.2% (OntoNotes)

Case Study 3: PII Detection Optimization¶

Goal: Optimize for PII detection with extreme class imbalance

Setup:

data:
  dataset_configs:
    - name: gretel_pii
    - name: ai4privacy

hyperopt:
  enabled: true
  strategy: "combined"
  num_trials: 60
  metric: "eval_f1_macro"  # Macro F1 for imbalanced classes

  search_space:
    # Loss function parameters
    focal_gamma:
      type: "uniform"
      low: 2.0
      high: 5.0  # Higher for extreme imbalance

    batch_balance_beta:
      type: "loguniform"
      low: 0.99
      high: 0.9999

    # Training stability
    gradient_clip_val:
      type: "choice"
      choices: [0.5, 1.0, 2.0]

    learning_rate_scheduler:
      type: "choice"
      choices: ["linear", "cosine", "polynomial"]

Results Analysis:

Standard Focal Loss (γ = 2.0)
Macro F1: 72.3%
Rare class recall: 45%
Training time: 2.5h
Optimized Batch-Balanced (γ = 3.7, β = 0.995)
Macro F1: 81.7% ✅
Rare class recall: 68%
Training time: 2.8h

Case Study 4: Resource-Constrained Optimization¶

Goal: Find best configuration for 16GB GPU

Constraints: Limited memory, need efficient training

Setup:

hyperopt:
  enabled: true
  strategy: "asha"  # Focus on early stopping
  num_trials: 40

  search_space:
    # Memory-efficient parameters
    per_device_train_batch_size:
      type: "choice"
      choices: [2, 4, 6]

    gradient_accumulation_steps:
      type: "choice"
      choices: [4, 8, 16]

    lora_r:
      type: "choice"
      choices: [8, 16, 32]  # Lower values for memory

    gradient_checkpointing:
      type: "choice"
      choices: [true, false]

    mixed_precision:
      type: "choice"
      choices: ["fp16", "bf16", "no"]

Optimal Configuration Found:

# Best memory-efficient configuration
per_device_train_batch_size: 4
gradient_accumulation_steps: 8  # Effective batch size: 32
lora_r: 16
gradient_checkpointing: true
mixed_precision: "fp16"
# Result: 89.3% F1 with 14.2GB peak memory

📚 References¶

Next Steps

After finding optimal hyperparameters, check our Configuration Reference to understand all available options and create your production configuration.