Hyperparameter Optimization for Mistral NER¶
This document describes the hyperparameter optimization feature that combines Bayesian Optimization (via Optuna) with Hyperband (via Ray Tune) for efficient and intelligent hyperparameter search.
🎯 Overview¶
The hyperparameter optimization system provides multiple strategies for finding optimal configurations:
- 🧠 Bayesian Optimization: Intelligent search using Optuna's TPE algorithm
- ⚡ Hyperband: Efficient early stopping using Ray Tune's ASHA scheduler
- 🔥 Combined Strategy: Best of both worlds - intelligent search + smart stopping
- 🎲 Random Search: Baseline comparison strategy
🚀 Quick Start¶
1. Enable Hyperparameter Optimization¶
Add to your config YAML:
hyperopt:
enabled: true
strategy: "combined" # Recommended: Bayesian + Hyperband
num_trials: 50
search_space:
learning_rate:
type: "loguniform"
low: 1e-5
high: 1e-3
lora_r:
type: "choice"
choices: [8, 16, 32, 64]
2. Run Optimization¶
# Using the main training script
python scripts/train.py --config configs/hyperopt.yaml
# Or using the dedicated hyperopt script
python scripts/hyperopt.py --config configs/hyperopt.yaml
3. Get Results¶
The optimizer will: 1. 🔍 Search the hyperparameter space intelligently 2. ⚡ Stop poor-performing trials early 3. 📊 Report the best configuration 4. 🏆 Train a final model with optimal hyperparameters
📋 Configuration Guide¶
Strategy Options¶
Strategy Comparison¶
Strategy | Search Method | Early Stopping | Best For |
---|---|---|---|
combined |
Bayesian (TPE) | ASHA | Recommended - Best performance |
optuna |
Bayesian (TPE) | None | Small search spaces |
asha |
Random | ASHA | Large search spaces |
random |
Random | None | Baseline comparison |
Search Space Definition¶
search_space:
# Continuous parameters
learning_rate:
type: "loguniform"
low: 1e-5
high: 1e-3
warmup_ratio:
type: "uniform"
low: 0.0
high: 0.1
# Discrete parameters
lora_r:
type: "choice"
choices: [8, 16, 32, 64]
per_device_train_batch_size:
type: "choice"
choices: [4, 8, 16]
# Integer parameters
num_epochs:
type: "int"
low: 1
high: 10
Parameter Types¶
uniform
: Continuous uniform distributionloguniform
: Log-uniform distribution (good for learning rates)choice
: Discrete categorical choicesint
: Integer rangelogint
: Log-scale integer range
Trial Settings¶
hyperopt:
num_trials: 50 # Total trials to run
max_concurrent: 4 # Parallel trials
timeout: 3600 # Max time in seconds
# Metric to optimize
metric: "eval_f1"
mode: "max" # "max" or "min"
Advanced Configuration¶
hyperopt:
# Optuna settings
optuna_sampler: "TPE" # TPE, CMA, Random, GPSampler
optuna_pruner: "median" # median, hyperband, none
optuna_n_startup_trials: 10
# ASHA settings
asha_max_t: 100 # Max training iterations
asha_grace_period: 10 # Min iterations before pruning
asha_reduction_factor: 3 # Aggressiveness of pruning
# Storage and logging
results_dir: "./hyperopt_results"
study_name: "mistral_ner_optimization"
log_to_file: true
🖥️ Usage Examples¶
Command Line Interface¶
# Basic optimization
python scripts/hyperopt.py --config configs/hyperopt.yaml
# Override strategy
python scripts/hyperopt.py --strategy combined --num-trials 20
# Save best configuration
python scripts/hyperopt.py --save-best-config best_hyperparams.yaml
# Dry run (show config and exit)
python scripts/hyperopt.py --dry-run
# Debug mode
python scripts/hyperopt.py --debug
Programmatic Usage¶
from src.config import Config
from src.hyperopt import HyperparameterOptimizer, create_objective_function
from src.hyperopt.utils import create_ray_tune_search_space
# Load configuration
config = Config.from_yaml("configs/hyperopt.yaml")
# Create optimizer
with HyperparameterOptimizer(config.hyperopt) as optimizer:
# Create objective function
objective_func = create_objective_function(
base_config=config,
hyperopt_config=config.hyperopt,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
data_collator=data_collator,
)
# Create search space
search_space = create_ray_tune_search_space(config.hyperopt)
# Run optimization
results = optimizer.optimize(objective_func, search_space, config)
# Get best result
best_result = results.get_best_result()
print(f"Best F1: {best_result.metrics['eval_f1']:.4f}")
print(f"Best config: {best_result.config}")
📊 Results and Analysis¶
Best Configuration¶
After optimization completes, you'll see:
==========================================
HYPERPARAMETER OPTIMIZATION COMPLETED
==========================================
Total trials completed: 50
Best eval_f1: 0.8742
Best hyperparameters:
learning_rate: 0.00034
lora_r: 32
per_device_train_batch_size: 8
warmup_ratio: 0.045
weight_decay: 0.0021
==========================================
Saving Results¶
Results Directory Structure¶
hyperopt_results/
├── study_name/
│ ├── trial_*/
│ │ ├── checkpoint.json
│ │ └── result.json
│ └── experiment_state.json
├── hyperopt.log
└── best_config.yaml
🔧 Distributed Optimization¶
Ray Cluster Setup¶
# Start Ray cluster
ray start --head --port=6379
# In other machines
ray start --address="head_node_ip:6379"
Configuration for Distributed¶
hyperopt:
ray_address: "ray://head_node_ip:10001"
max_concurrent: 16 # More parallel trials
resources_per_trial:
cpu: 2.0
gpu: 1.0
🎛️ Strategy Details¶
Combined Strategy (Recommended)¶
The combined strategy uses: - Optuna's TPE for intelligent hyperparameter suggestions - Ray Tune's ASHA for efficient early stopping
hyperopt:
strategy: "combined"
optuna_enabled: true
asha_enabled: true
# TPE will suggest promising hyperparameters
optuna_sampler: "TPE"
# ASHA will stop poor trials early
asha_grace_period: 1
asha_reduction_factor: 3
Benefits: - 🧠 Smart hyperparameter exploration - ⚡ 50-80% reduction in computation time - 🎯 Better final performance than random search - 📈 Efficient convergence
Optuna-Only Strategy¶
Best for: - Small search spaces - When you want to run all trials to completion - CPU-only environments
ASHA-Only Strategy¶
Best for: - Large search spaces - When early stopping is most important - Quick exploration
🐛 Troubleshooting¶
Common Issues¶
Ray Connection Issues¶
Memory Issues¶
hyperopt:
max_concurrent: 2 # Reduce parallel trials
resources_per_trial:
gpu: 0.5 # Share GPU memory
Slow Convergence¶
hyperopt:
optuna_n_startup_trials: 20 # More random trials first
asha_grace_period: 5 # Allow more training before stopping
Debug Mode¶
# Enable debug logging
python scripts/hyperopt.py --debug
# Check specific trial logs
tail -f hyperopt_results/study_name/trial_*/logs.txt
📈 Performance Tips¶
1. Efficient Search Space Design¶
# Good: Log-uniform for learning rates
learning_rate:
type: "loguniform"
low: 1e-5
high: 1e-3
# Good: Discrete choices for model architecture
lora_r:
type: "choice"
choices: [8, 16, 32, 64]
# Avoid: Too large search spaces
# batch_size:
# type: "int"
# low: 1
# high: 1000 # Too large!
2. Optimal Trial Settings¶
# For development
num_trials: 20
max_concurrent: 4
# For production
num_trials: 100
max_concurrent: 8
3. Early Stopping Configuration¶
# Aggressive early stopping (faster, may miss good configs)
asha_grace_period: 1
asha_reduction_factor: 4
# Conservative early stopping (slower, more thorough)
asha_grace_period: 5
asha_reduction_factor: 2
🔗 Integration with Existing Workflow¶
With Regular Training¶
# Run hyperparameter optimization first
python scripts/hyperopt.py --save-best-config best.yaml
# Then train with best configuration
python scripts/train.py --config best.yaml
With WandB Integration¶
The optimizer automatically: - 🚫 Disables WandB for individual trials (reduces clutter) - ✅ Enables WandB for final training with best hyperparameters - 📊 Logs optimization summary to main project
With Existing Configs¶
# Your existing config
model:
model_name: "mistralai/Mistral-7B-v0.3"
# ... other settings
training:
num_train_epochs: 5
# ... other settings
# Add hyperopt section
hyperopt:
enabled: true
strategy: "combined"
# ... hyperopt settings
📊 Case Studies¶
Case Study 1: Optimizing for CoNLL-2003¶
Goal: Maximize F1 score on CoNLL-2003 English dataset
Setup:
hyperopt:
enabled: true
strategy: "combined"
num_trials: 100
metric: "eval_f1"
mode: "max"
search_space:
learning_rate:
type: "loguniform"
low: 1e-5
high: 5e-4
lora_r:
type: "choice"
choices: [16, 32, 64, 128]
lora_alpha:
type: "choice"
choices: [16, 32, 64, 128]
lora_dropout:
type: "uniform"
low: 0.0
high: 0.2
warmup_ratio:
type: "uniform"
low: 0.0
high: 0.1
weight_decay:
type: "loguniform"
low: 1e-5
high: 1e-2
Results:
Trial | Learning Rate | LoRA r | LoRA α | Dropout | F1 Score | Time (min) |
---|---|---|---|---|---|---|
1 | 2.3e-4 | 16 | 32 | 0.05 | 88.5% | 45 |
15 | 1.8e-4 | 32 | 64 | 0.08 | 89.7% | 52 |
34 | 3.1e-4 | 64 | 64 | 0.05 | 90.3% | 68 |
67 | 2.7e-4 | 32 | 64 | 0.05 | 91.2% | 51 |
89 | 2.9e-4 | 32 | 32 | 0.10 | 90.8% | 49 |
Insights: - Optimal learning rate: 2.7e-4 (higher than default 2e-4) - Best LoRA configuration: r=32, alpha=64 (2:1 ratio) - Low dropout (0.05) works better than higher values - Convergence achieved after ~70 trials
Case Study 2: Multi-Dataset Optimization¶
Goal: Find best hyperparameters for combined CoNLL + OntoNotes training
Challenge: Different datasets have different characteristics
Setup:
data:
dataset_configs:
- name: conll2003
weight: 1.0
- name: ontonotes
weight: 1.0
mixing_strategy: interleave
hyperopt:
enabled: true
strategy: "combined"
num_trials: 80
search_space:
# Standard parameters
learning_rate:
type: "loguniform"
low: 5e-6
high: 2e-4
# Dataset-specific optimization
focal_gamma:
type: "uniform"
low: 1.0
high: 4.0
loss_type:
type: "choice"
choices: ["focal", "batch_balanced_focal", "class_balanced"]
# Mixing parameters
interleave_probs:
type: "choice"
choices: [[0.5, 0.5], [0.6, 0.4], [0.7, 0.3]]
Results:
Best configuration found: - Learning rate: 8.9e-5 (lower due to more data) - Loss: batch_balanced_focal with gamma=2.3 - Interleave probabilities: [0.6, 0.4] (slight CoNLL bias) - Final F1: 89.8% (CoNLL), 87.2% (OntoNotes)
Case Study 3: PII Detection Optimization¶
Goal: Optimize for PII detection with extreme class imbalance
Setup:
data:
dataset_configs:
- name: gretel_pii
- name: ai4privacy
hyperopt:
enabled: true
strategy: "combined"
num_trials: 60
metric: "eval_f1_macro" # Macro F1 for imbalanced classes
search_space:
# Loss function parameters
focal_gamma:
type: "uniform"
low: 2.0
high: 5.0 # Higher for extreme imbalance
batch_balance_beta:
type: "loguniform"
low: 0.99
high: 0.9999
# Training stability
gradient_clip_val:
type: "choice"
choices: [0.5, 1.0, 2.0]
learning_rate_scheduler:
type: "choice"
choices: ["linear", "cosine", "polynomial"]
Results Analysis:
- Standard Focal Loss (γ = 2.0)
- Macro F1: 72.3%
- Rare class recall: 45%
-
Training time: 2.5h
-
Optimized Batch-Balanced (γ = 3.7, β = 0.995)
- Macro F1: 81.7% ✅
- Rare class recall: 68%
- Training time: 2.8h
Case Study 4: Resource-Constrained Optimization¶
Goal: Find best configuration for 16GB GPU
Constraints: Limited memory, need efficient training
Setup:
hyperopt:
enabled: true
strategy: "asha" # Focus on early stopping
num_trials: 40
search_space:
# Memory-efficient parameters
per_device_train_batch_size:
type: "choice"
choices: [2, 4, 6]
gradient_accumulation_steps:
type: "choice"
choices: [4, 8, 16]
lora_r:
type: "choice"
choices: [8, 16, 32] # Lower values for memory
gradient_checkpointing:
type: "choice"
choices: [true, false]
mixed_precision:
type: "choice"
choices: ["fp16", "bf16", "no"]
Optimal Configuration Found:
# Best memory-efficient configuration
per_device_train_batch_size: 4
gradient_accumulation_steps: 8 # Effective batch size: 32
lora_r: 16
gradient_checkpointing: true
mixed_precision: "fp16"
# Result: 89.3% F1 with 14.2GB peak memory
📚 References¶
Next Steps
After finding optimal hyperparameters, check our Configuration Reference to understand all available options and create your production configuration.