Troubleshooting Guide¶

This guide covers common issues and their solutions when using Mistral NER.

Installation Issues¶

CUDA Not Available¶

Symptom: torch.cuda.is_available() returns False

Solutions: 1. Verify CUDA installation:

nvidia-smi
nvcc --version

Install correct PyTorch version:

# For CUDA 12.x
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# For CUDA 11.x
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

BitsAndBytes Installation Failed¶

Symptom: Error when installing or importing bitsandbytes

Solutions: 1. Install from source:

pip install bitsandbytes --no-binary bitsandbytes

For Windows users:
```
pip install bitsandbytes-windows
```

Training Issues¶

Out of Memory (OOM) Errors¶

Symptom: CUDA out of memory error during training

Solutions:

Enable quantization:

model:
  load_in_4bit: true  # Uses ~6GB instead of 24GB

Reduce batch size:

training:
  per_device_train_batch_size: 2
  gradient_accumulation_steps: 16

Enable gradient checkpointing:

training:
  gradient_checkpointing: true

Clear cache periodically:
```
training:
  clear_cache_steps: 50
```

Training Extremely Slow¶

Symptom: Each epoch takes hours

Solutions:

Enable mixed precision:

training:
  fp16: true  # or bf16 for A100 GPUs

Check data loading:
```
data:
  preprocessing_num_workers: 4
```

Disable unnecessary logging:

training:
  logging_steps: 100  # Increase from default 10

Model Not Learning (Loss Not Decreasing)¶

Symptom: Loss plateaus or increases

Solutions:

Adjust learning rate:

training:
  learning_rate: 1e-4  # Try different values
  warmup_ratio: 0.1

Change loss function:

training:
  loss_type: "focal"
  focal_gamma: 3.0  # Increase for imbalanced data

Check data quality:

# Visualize label distribution
from src.visualization import plot_label_distribution
plot_label_distribution(train_dataset)

Quantization Issues¶

Quantization Not Working¶

Symptom: Model loads in full precision despite quantization settings

Solutions:

Check bitsandbytes installation:

import bitsandbytes as bnb
print(f"BitsAndBytes version: {bnb.__version__}")

Verify configuration:

model:
  load_in_8bit: true   # Only one should be true
  load_in_4bit: false

Check GPU compatibility:
Requires GPU with compute capability >= 3.5
Run nvidia-smi to check GPU model

8-bit vs 4-bit Quantization¶

When to use each: - 8-bit: Better accuracy, ~10GB VRAM - 4-bit: More memory efficient, ~6GB VRAM

Inference Issues¶

Predictions All "O" (No Entities)¶

Symptom: Model only predicts non-entity labels

Solutions:

Check threshold if using confidence filtering

Verify model loaded correctly:

# Check if LoRA weights are loaded
print(model.peft_config)

Use different checkpoint:

# Try best checkpoint instead of final
python scripts/inference.py --model-path ./mistral-ner-finetuned/checkpoint-best

Tokenization Misalignment¶

Symptom: Entity boundaries don't match original text

Solutions:

Enable proper tokenizer settings:

tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    add_prefix_space=True  # Important for proper alignment
)

Check max_length setting:

data:
  max_length: 256  # Increase if truncating

WandB Issues¶

WandB Offline Mode Not Working¶

Symptom: Runs not syncing after coming online

Solutions:

Sync manually:
```
wandb sync ./wandb_logs/offline-run-*
```

Check environment:

export WANDB_MODE=offline
export WANDB_DIR=./wandb_logs

See WandB Offline Mode for detailed guide.

Multi-Dataset Issues¶

Label Mismatch Errors¶

Symptom: Error about incompatible label sets

Solutions:

Check label mappings:

for dataset_config in config.dataset_configs:
    loader = registry.get_loader(dataset_config.name)
    print(f"{dataset_config.name}: {loader.get_labels()}")

Use unified label schema:
```
data:
  use_unified_labels: true
```

Performance Issues¶

Low F1 Score¶

Common causes and solutions:

Class imbalance: Use focal loss
Insufficient training: Increase epochs
Poor hyperparameters: Use hyperopt
Dataset quality: Check annotation consistency

Inconsistent Results¶

Solutions:

Set seeds:
```
training:
  seed: 42
  data_seed: 42
```

Disable non-deterministic ops:

torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

Debug Mode¶

Enable comprehensive debugging:

training:
  debug: true

logging:
  log_level: "debug"

This will show: - Detailed model loading info - Tokenization examples - Batch composition - Memory usage

Getting Help¶

If these solutions don't resolve your issue:

Check logs: Look for error messages in ./logs/
Run validation: python scripts/validate.py
Create minimal example: Isolate the problem
Report issue: Include config, error message, and environment info

Environment Debugging¶

Collect system information:

# save as debug_env.py
import torch
import transformers
import platform

print(f"Python: {platform.python_version()}")
print(f"PyTorch: {torch.__version__}")
print(f"Transformers: {transformers.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU: {torch.cuda.get_device_name(0)}")

Prevention is Better

Most issues can be prevented by: - Starting with default configurations - Testing on small data subsets first - Monitoring resource usage - Keeping dependencies updated