Troubleshooting Guide¶
This guide covers common issues and their solutions when using Mistral NER.
Installation Issues¶
CUDA Not Available¶
Symptom: torch.cuda.is_available()
returns False
Solutions: 1. Verify CUDA installation:
- Install correct PyTorch version:
BitsAndBytes Installation Failed¶
Symptom: Error when installing or importing bitsandbytes
Solutions: 1. Install from source:
- For Windows users:
Training Issues¶
Out of Memory (OOM) Errors¶
Symptom: CUDA out of memory
error during training
Solutions:
-
Enable quantization:
-
Reduce batch size:
-
Enable gradient checkpointing:
-
Clear cache periodically:
Training Extremely Slow¶
Symptom: Each epoch takes hours
Solutions:
-
Enable mixed precision:
-
Check data loading:
-
Disable unnecessary logging:
Model Not Learning (Loss Not Decreasing)¶
Symptom: Loss plateaus or increases
Solutions:
-
Adjust learning rate:
-
Change loss function:
-
Check data quality:
Quantization Issues¶
Quantization Not Working¶
Symptom: Model loads in full precision despite quantization settings
Solutions:
-
Check bitsandbytes installation:
-
Verify configuration:
-
Check GPU compatibility:
- Requires GPU with compute capability >= 3.5
- Run
nvidia-smi
to check GPU model
8-bit vs 4-bit Quantization¶
When to use each: - 8-bit: Better accuracy, ~10GB VRAM - 4-bit: More memory efficient, ~6GB VRAM
Inference Issues¶
Predictions All "O" (No Entities)¶
Symptom: Model only predicts non-entity labels
Solutions:
- Check threshold if using confidence filtering
-
Verify model loaded correctly:
-
Use different checkpoint:
Tokenization Misalignment¶
Symptom: Entity boundaries don't match original text
Solutions:
-
Enable proper tokenizer settings:
-
Check max_length setting:
WandB Issues¶
WandB Offline Mode Not Working¶
Symptom: Runs not syncing after coming online
Solutions:
-
Sync manually:
-
Check environment:
See WandB Offline Mode for detailed guide.
Multi-Dataset Issues¶
Label Mismatch Errors¶
Symptom: Error about incompatible label sets
Solutions:
-
Check label mappings:
-
Use unified label schema:
Performance Issues¶
Low F1 Score¶
Common causes and solutions:
- Class imbalance: Use focal loss
- Insufficient training: Increase epochs
- Poor hyperparameters: Use hyperopt
- Dataset quality: Check annotation consistency
Inconsistent Results¶
Solutions:
-
Set seeds:
-
Disable non-deterministic ops:
Debug Mode¶
Enable comprehensive debugging:
This will show: - Detailed model loading info - Tokenization examples - Batch composition - Memory usage
Getting Help¶
If these solutions don't resolve your issue:
- Check logs: Look for error messages in
./logs/
- Run validation:
python scripts/validate.py
- Create minimal example: Isolate the problem
- Report issue: Include config, error message, and environment info
Environment Debugging¶
Collect system information:
# save as debug_env.py
import torch
import transformers
import platform
print(f"Python: {platform.python_version()}")
print(f"PyTorch: {torch.__version__}")
print(f"Transformers: {transformers.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
print(f"CUDA version: {torch.version.cuda}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
Prevention is Better
Most issues can be prevented by: - Starting with default configurations - Testing on small data subsets first - Monitoring resource usage - Keeping dependencies updated