Skip to content

Hyperparameter Optimization for Mistral NER

This document describes the hyperparameter optimization feature that combines Bayesian Optimization (via Optuna) with Hyperband (via Ray Tune) for efficient and intelligent hyperparameter search.

🎯 Overview

The hyperparameter optimization system provides multiple strategies for finding optimal configurations:

  • 🧠 Bayesian Optimization: Intelligent search using Optuna's TPE algorithm
  • ⚡ Hyperband: Efficient early stopping using Ray Tune's ASHA scheduler
  • 🔥 Combined Strategy: Best of both worlds - intelligent search + smart stopping
  • 🎲 Random Search: Baseline comparison strategy

🚀 Quick Start

1. Enable Hyperparameter Optimization

Add to your config YAML:

hyperopt:
  enabled: true
  strategy: "combined"  # Recommended: Bayesian + Hyperband
  num_trials: 50
  search_space:
    learning_rate:
      type: "loguniform"
      low: 1e-5
      high: 1e-3
    lora_r:
      type: "choice"
      choices: [8, 16, 32, 64]

2. Run Optimization

# Using the main training script
python scripts/train.py --config configs/hyperopt.yaml

# Or using the dedicated hyperopt script
python scripts/hyperopt.py --config configs/hyperopt.yaml

3. Get Results

The optimizer will: 1. 🔍 Search the hyperparameter space intelligently 2. ⚡ Stop poor-performing trials early 3. 📊 Report the best configuration 4. 🏆 Train a final model with optimal hyperparameters

📋 Configuration Guide

Strategy Options

hyperopt:
  strategy: "combined"    # Recommended
  # Options: "optuna", "asha", "combined", "random"

Strategy Comparison

Strategy Search Method Early Stopping Best For
combined Bayesian (TPE) ASHA Recommended - Best performance
optuna Bayesian (TPE) None Small search spaces
asha Random ASHA Large search spaces
random Random None Baseline comparison

Search Space Definition

search_space:
  # Continuous parameters
  learning_rate:
    type: "loguniform"
    low: 1e-5
    high: 1e-3

  warmup_ratio:
    type: "uniform"
    low: 0.0
    high: 0.1

  # Discrete parameters
  lora_r:
    type: "choice"
    choices: [8, 16, 32, 64]

  per_device_train_batch_size:
    type: "choice"
    choices: [4, 8, 16]

  # Integer parameters
  num_epochs:
    type: "int"
    low: 1
    high: 10

Parameter Types

  • uniform: Continuous uniform distribution
  • loguniform: Log-uniform distribution (good for learning rates)
  • choice: Discrete categorical choices
  • int: Integer range
  • logint: Log-scale integer range

Trial Settings

hyperopt:
  num_trials: 50           # Total trials to run
  max_concurrent: 4        # Parallel trials
  timeout: 3600           # Max time in seconds

  # Metric to optimize
  metric: "eval_f1"
  mode: "max"             # "max" or "min"

Advanced Configuration

hyperopt:
  # Optuna settings
  optuna_sampler: "TPE"           # TPE, CMA, Random, GPSampler
  optuna_pruner: "median"         # median, hyperband, none
  optuna_n_startup_trials: 10

  # ASHA settings
  asha_max_t: 100                 # Max training iterations
  asha_grace_period: 10           # Min iterations before pruning
  asha_reduction_factor: 3        # Aggressiveness of pruning

  # Storage and logging
  results_dir: "./hyperopt_results"
  study_name: "mistral_ner_optimization"
  log_to_file: true

🖥️ Usage Examples

Command Line Interface

# Basic optimization
python scripts/hyperopt.py --config configs/hyperopt.yaml

# Override strategy
python scripts/hyperopt.py --strategy combined --num-trials 20

# Save best configuration
python scripts/hyperopt.py --save-best-config best_hyperparams.yaml

# Dry run (show config and exit)
python scripts/hyperopt.py --dry-run

# Debug mode
python scripts/hyperopt.py --debug

Programmatic Usage

from src.config import Config
from src.hyperopt import HyperparameterOptimizer, create_objective_function
from src.hyperopt.utils import create_ray_tune_search_space

# Load configuration
config = Config.from_yaml("configs/hyperopt.yaml")

# Create optimizer
with HyperparameterOptimizer(config.hyperopt) as optimizer:
    # Create objective function
    objective_func = create_objective_function(
        base_config=config,
        hyperopt_config=config.hyperopt,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        data_collator=data_collator,
    )

    # Create search space
    search_space = create_ray_tune_search_space(config.hyperopt)

    # Run optimization
    results = optimizer.optimize(objective_func, search_space, config)

    # Get best result
    best_result = results.get_best_result()
    print(f"Best F1: {best_result.metrics['eval_f1']:.4f}")
    print(f"Best config: {best_result.config}")

📊 Results and Analysis

Best Configuration

After optimization completes, you'll see:

==========================================
HYPERPARAMETER OPTIMIZATION COMPLETED
==========================================
Total trials completed: 50
Best eval_f1: 0.8742
Best hyperparameters:
  learning_rate: 0.00034
  lora_r: 32
  per_device_train_batch_size: 8
  warmup_ratio: 0.045
  weight_decay: 0.0021
==========================================

Saving Results

# Save best configuration to file
python scripts/hyperopt.py --save-best-config best_config.yaml

Results Directory Structure

hyperopt_results/
├── study_name/
│   ├── trial_*/
│   │   ├── checkpoint.json
│   │   └── result.json
│   └── experiment_state.json
├── hyperopt.log
└── best_config.yaml

🔧 Distributed Optimization

Ray Cluster Setup

# Start Ray cluster
ray start --head --port=6379

# In other machines
ray start --address="head_node_ip:6379"

Configuration for Distributed

hyperopt:
  ray_address: "ray://head_node_ip:10001"
  max_concurrent: 16        # More parallel trials
  resources_per_trial:
    cpu: 2.0
    gpu: 1.0

🎛️ Strategy Details

The combined strategy uses: - Optuna's TPE for intelligent hyperparameter suggestions - Ray Tune's ASHA for efficient early stopping

hyperopt:
  strategy: "combined"
  optuna_enabled: true
  asha_enabled: true

  # TPE will suggest promising hyperparameters
  optuna_sampler: "TPE"

  # ASHA will stop poor trials early
  asha_grace_period: 1
  asha_reduction_factor: 3

Benefits: - 🧠 Smart hyperparameter exploration - ⚡ 50-80% reduction in computation time - 🎯 Better final performance than random search - 📈 Efficient convergence

Optuna-Only Strategy

hyperopt:
  strategy: "optuna"
  optuna_sampler: "TPE"    # Tree-structured Parzen Estimator

Best for: - Small search spaces - When you want to run all trials to completion - CPU-only environments

ASHA-Only Strategy

hyperopt:
  strategy: "asha"
  asha_max_t: 100
  asha_grace_period: 10

Best for: - Large search spaces - When early stopping is most important - Quick exploration

🐛 Troubleshooting

Common Issues

Ray Connection Issues

# Check Ray status
ray status

# Reset Ray
ray stop
ray start --head

Memory Issues

hyperopt:
  max_concurrent: 2        # Reduce parallel trials
  resources_per_trial:
    gpu: 0.5              # Share GPU memory

Slow Convergence

hyperopt:
  optuna_n_startup_trials: 20    # More random trials first
  asha_grace_period: 5          # Allow more training before stopping

Debug Mode

# Enable debug logging
python scripts/hyperopt.py --debug

# Check specific trial logs
tail -f hyperopt_results/study_name/trial_*/logs.txt

📈 Performance Tips

1. Efficient Search Space Design

# Good: Log-uniform for learning rates
learning_rate:
  type: "loguniform"
  low: 1e-5
  high: 1e-3

# Good: Discrete choices for model architecture
lora_r:
  type: "choice"
  choices: [8, 16, 32, 64]

# Avoid: Too large search spaces
# batch_size:
#   type: "int"
#   low: 1
#   high: 1000  # Too large!

2. Optimal Trial Settings

# For development
num_trials: 20
max_concurrent: 4

# For production
num_trials: 100
max_concurrent: 8

3. Early Stopping Configuration

# Aggressive early stopping (faster, may miss good configs)
asha_grace_period: 1
asha_reduction_factor: 4

# Conservative early stopping (slower, more thorough)
asha_grace_period: 5
asha_reduction_factor: 2

🔗 Integration with Existing Workflow

With Regular Training

# Run hyperparameter optimization first
python scripts/hyperopt.py --save-best-config best.yaml

# Then train with best configuration
python scripts/train.py --config best.yaml

With WandB Integration

The optimizer automatically: - 🚫 Disables WandB for individual trials (reduces clutter) - ✅ Enables WandB for final training with best hyperparameters - 📊 Logs optimization summary to main project

With Existing Configs

# Your existing config
model:
  model_name: "mistralai/Mistral-7B-v0.3"
  # ... other settings

training:
  num_train_epochs: 5
  # ... other settings

# Add hyperopt section
hyperopt:
  enabled: true
  strategy: "combined"
  # ... hyperopt settings

📊 Case Studies

Case Study 1: Optimizing for CoNLL-2003

Goal: Maximize F1 score on CoNLL-2003 English dataset

Setup:

hyperopt:
  enabled: true
  strategy: "combined"
  num_trials: 100
  metric: "eval_f1"
  mode: "max"

  search_space:
    learning_rate:
      type: "loguniform"
      low: 1e-5
      high: 5e-4

    lora_r:
      type: "choice"
      choices: [16, 32, 64, 128]

    lora_alpha:
      type: "choice"
      choices: [16, 32, 64, 128]

    lora_dropout:
      type: "uniform"
      low: 0.0
      high: 0.2

    warmup_ratio:
      type: "uniform"
      low: 0.0
      high: 0.1

    weight_decay:
      type: "loguniform"
      low: 1e-5
      high: 1e-2

Results:

Trial Learning Rate LoRA r LoRA α Dropout F1 Score Time (min)
1 2.3e-4 16 32 0.05 88.5% 45
15 1.8e-4 32 64 0.08 89.7% 52
34 3.1e-4 64 64 0.05 90.3% 68
67 2.7e-4 32 64 0.05 91.2% 51
89 2.9e-4 32 32 0.10 90.8% 49

Insights: - Optimal learning rate: 2.7e-4 (higher than default 2e-4) - Best LoRA configuration: r=32, alpha=64 (2:1 ratio) - Low dropout (0.05) works better than higher values - Convergence achieved after ~70 trials

Case Study 2: Multi-Dataset Optimization

Goal: Find best hyperparameters for combined CoNLL + OntoNotes training

Challenge: Different datasets have different characteristics

Setup:

data:
  dataset_configs:
    - name: conll2003
      weight: 1.0
    - name: ontonotes
      weight: 1.0
  mixing_strategy: interleave

hyperopt:
  enabled: true
  strategy: "combined"
  num_trials: 80

  search_space:
    # Standard parameters
    learning_rate:
      type: "loguniform"
      low: 5e-6
      high: 2e-4

    # Dataset-specific optimization
    focal_gamma:
      type: "uniform"
      low: 1.0
      high: 4.0

    loss_type:
      type: "choice"
      choices: ["focal", "batch_balanced_focal", "class_balanced"]

    # Mixing parameters
    interleave_probs:
      type: "choice"
      choices: [[0.5, 0.5], [0.6, 0.4], [0.7, 0.3]]

Results:

Best configuration found: - Learning rate: 8.9e-5 (lower due to more data) - Loss: batch_balanced_focal with gamma=2.3 - Interleave probabilities: [0.6, 0.4] (slight CoNLL bias) - Final F1: 89.8% (CoNLL), 87.2% (OntoNotes)

Case Study 3: PII Detection Optimization

Goal: Optimize for PII detection with extreme class imbalance

Setup:

data:
  dataset_configs:
    - name: gretel_pii
    - name: ai4privacy

hyperopt:
  enabled: true
  strategy: "combined"
  num_trials: 60
  metric: "eval_f1_macro"  # Macro F1 for imbalanced classes

  search_space:
    # Loss function parameters
    focal_gamma:
      type: "uniform"
      low: 2.0
      high: 5.0  # Higher for extreme imbalance

    batch_balance_beta:
      type: "loguniform"
      low: 0.99
      high: 0.9999

    # Training stability
    gradient_clip_val:
      type: "choice"
      choices: [0.5, 1.0, 2.0]

    learning_rate_scheduler:
      type: "choice"
      choices: ["linear", "cosine", "polynomial"]

Results Analysis:

  • Standard Focal Loss (γ = 2.0)
  • Macro F1: 72.3%
  • Rare class recall: 45%
  • Training time: 2.5h

  • Optimized Batch-Balanced (γ = 3.7, β = 0.995)

  • Macro F1: 81.7% ✅
  • Rare class recall: 68%
  • Training time: 2.8h

Case Study 4: Resource-Constrained Optimization

Goal: Find best configuration for 16GB GPU

Constraints: Limited memory, need efficient training

Setup:

hyperopt:
  enabled: true
  strategy: "asha"  # Focus on early stopping
  num_trials: 40

  search_space:
    # Memory-efficient parameters
    per_device_train_batch_size:
      type: "choice"
      choices: [2, 4, 6]

    gradient_accumulation_steps:
      type: "choice"
      choices: [4, 8, 16]

    lora_r:
      type: "choice"
      choices: [8, 16, 32]  # Lower values for memory

    gradient_checkpointing:
      type: "choice"
      choices: [true, false]

    mixed_precision:
      type: "choice"
      choices: ["fp16", "bf16", "no"]

Optimal Configuration Found:

# Best memory-efficient configuration
per_device_train_batch_size: 4
gradient_accumulation_steps: 8  # Effective batch size: 32
lora_r: 16
gradient_checkpointing: true
mixed_precision: "fp16"
# Result: 89.3% F1 with 14.2GB peak memory

📚 References


Next Steps

After finding optimal hyperparameters, check our Configuration Reference to understand all available options and create your production configuration.