Datasets for NER Training¶
Overview¶
Mistral NER supports 9 diverse datasets out of the box, ranging from traditional NER benchmarks to specialized PII detection datasets. The system provides a unified interface for loading, preprocessing, and mixing multiple datasets, enabling you to train models on combined data sources for improved generalization.
Dataset Catalog¶
Traditional NER Datasets¶
PII Detection Datasets¶
Multi-Dataset Training¶
One of Mistral NER's most powerful features is the ability to train on multiple datasets simultaneously, leveraging diverse data sources for better generalization.
Mixing Strategies¶
1. Concatenation (Default)¶
Simply combines all datasets into one large training set.
Use when: Datasets have similar characteristics and you want maximum data.
2. Interleaving¶
Alternates between datasets during training, ensuring balanced exposure.
data:
dataset_configs:
- name: conll2003
- name: gretel_pii
mixing_strategy: interleave
interleave_probs: [0.7, 0.3] # 70% CoNLL, 30% Gretel
Use when: Datasets have different sizes and you want balanced training.
3. Weighted Sampling¶
Samples from datasets based on specified weights.
data:
dataset_configs:
- name: conll2003
weight: 2.0
- name: wnut17
weight: 1.0
- name: ai4privacy
weight: 1.5
mixing_strategy: weighted
sampling_temperature: 1.0 # Controls randomness
Use when: You want fine control over dataset contribution.
Label Mapping System¶
Mistral NER provides a flexible label mapping system to unify different label schemas across datasets. This is essential when combining datasets with different naming conventions or when you want to merge certain entity types.
Configuration Methods¶
1. Profile-Based Mapping (Recommended)¶
Use predefined mapping profiles for common use cases:
data:
multi_dataset:
enabled: true
dataset_names: ["conll2003", "ontonotes", "gretel_pii", "ai4privacy"]
label_mapping_profile: "bank_pii" # Predefined profile
Available profiles:
- bank_pii
: Maps all location types to ADDR for banking/financial PII detection
- general
: Preserves most entity distinctions (default behavior)
2. External Mapping Files¶
Define mappings in separate YAML files for better organization:
data:
multi_dataset:
label_mappings:
conll2003: "conll2003_bank_pii.yaml"
ontonotes: "ontonotes_custom.yaml"
gretel_pii: "gretel_pii_custom.yaml"
Example mapping file (configs/mappings/conll2003_bank_pii.yaml
):
# Map CoNLL-2003 labels to unified schema
O: O
B-PER: B-PER
I-PER: I-PER
B-ORG: B-ORG
I-ORG: I-ORG
B-LOC: B-ADDR # Map locations to addresses
I-LOC: I-ADDR
B-MISC: B-MISC
I-MISC: I-MISC
3. Inline Mapping¶
Define mappings directly in the configuration:
data:
multi_dataset:
label_mappings:
conll2003:
B-LOC: B-ADDR
I-LOC: I-ADDR
ontonotes:
B-GPE: B-ADDR
I-GPE: I-ADDR
B-FAC: B-ADDR
I-FAC: I-ADDR
Label Unification Example¶
When combining datasets, label conflicts are resolved using your chosen mapping:
graph LR
subgraph "Dataset 1 (CoNLL)"
A1[PER] --> U1[PER]
A2[ORG] --> U2[ORG]
A3[LOC] --> U3[ADDR]
end
subgraph "Dataset 2 (OntoNotes)"
B1[PERSON] --> U1
B2[ORG] --> U2
B3[GPE] --> U3
B4[FAC] --> U3
end
subgraph "Unified Schema"
U1[PER - Person]
U2[ORG - Organization]
U3[ADDR - All Locations]
end
Creating Custom Mapping Profiles¶
Add new profiles to src/datasets/mapping_profiles.py
:
class MappingProfiles:
# Custom profile for medical domain
MEDICAL = {
"conll2003": {
"B-PER": "B-PATIENT",
"I-PER": "I-PATIENT",
"B-ORG": "B-PROVIDER",
"I-ORG": "I-PROVIDER",
# ... more mappings
},
"ontonotes": {
# ... dataset-specific mappings
}
}
Configuration Examples¶
Example 1: Combined Traditional NER¶
data:
dataset_configs:
- name: conll2003
split: train
- name: ontonotes
split: train
subset: english_v4
- name: wnut17
split: train
mixing_strategy: interleave
max_length: 256
label_all_tokens: false
Example 2: PII Detection Focus¶
data:
dataset_configs:
- name: gretel_pii
weight: 2.0
- name: ai4privacy
weight: 2.0
- name: mendeley_pii
weight: 1.0
- name: conll2003 # For general NER capability
weight: 0.5
mixing_strategy: weighted
sampling_temperature: 0.5 # More deterministic sampling
Example 3: Comprehensive Model¶
data:
dataset_configs:
- name: conll2003
- name: ontonotes
- name: fewnerd
label_type: coarse
- name: gretel_pii
- name: ai4privacy
mixing_strategy: concatenate
# Unified label schema will be created automatically
Adding Custom Datasets¶
Step 1: Create Dataset Loader¶
Create a new loader in src/datasets/loaders/
:
from typing import Dict, List, Any
from datasets import Dataset, load_dataset
from ..base import BaseNERDataset
class MyCustomDataset(BaseNERDataset):
"""Custom dataset loader."""
LABEL_MAPPING = {
"person": "PERSON",
"place": "LOCATION",
"company": "ORGANIZATION",
# Add your mappings
}
def load(self) -> Dataset:
"""Load the dataset."""
# Option 1: Load from HuggingFace
dataset = load_dataset("username/dataset-name", split=self.split)
# Option 2: Load from local files
# dataset = self._load_from_files("path/to/data")
return dataset
def get_labels(self) -> List[str]:
"""Return list of labels."""
return ["O", "B-PERSON", "I-PERSON", "B-LOCATION", ...]
Step 2: Register Dataset¶
Add to the registry in src/datasets/registry.py
:
from .loaders.my_custom import MyCustomDataset
# In DatasetRegistry._register_default_loaders()
self.register("my_custom", MyCustomDataset)
Step 3: Use in Configuration¶
Dataset Statistics and Analysis¶
Understanding Your Data¶
Use the built-in analysis tools:
from src.datasets import DatasetRegistry, DatasetMixer
# Load datasets
registry = DatasetRegistry()
datasets = [
registry.get_loader("conll2003").load(),
registry.get_loader("gretel_pii").load()
]
# Analyze
mixer = DatasetMixer(strategy="concatenate")
mixed_data = mixer.mix(datasets)
# Get statistics
print(f"Total examples: {len(mixed_data)}")
print(f"Label distribution: {mixer.get_label_distribution()}")
print(f"Average sequence length: {mixer.get_avg_sequence_length()}")
Visualization¶
The system provides visualization tools for dataset analysis:
from src.visualization import plot_label_distribution, plot_dataset_comparison
# Visualize label distribution
plot_label_distribution(train_dataset, save_path="label_dist.png")
# Compare datasets
plot_dataset_comparison(
datasets={"CoNLL": conll_data, "OntoNotes": onto_data},
save_path="dataset_comparison.png"
)
Best Practices¶
1. Start Simple¶
Begin with a single dataset to establish baseline performance:
2. Add Gradually¶
Add datasets one at a time, monitoring performance:
# Iteration 1: Baseline
- name: conll2003
# Iteration 2: Add similar dataset
- name: conll2003
- name: ontonotes
# Iteration 3: Add specialized data
- name: conll2003
- name: ontonotes
- name: gretel_pii
3. Balance Dataset Sizes¶
Use weights or interleaving to prevent large datasets from dominating:
data:
dataset_configs:
- name: large_dataset
weight: 1.0
- name: small_dataset
weight: 5.0 # Upweight smaller dataset
4. Monitor Label Distribution¶
Ensure rare labels are adequately represented:
# Check label distribution after mixing
label_counts = mixer.get_label_distribution()
for label, count in sorted(label_counts.items()):
print(f"{label}: {count} ({count/total*100:.1f}%)")
Performance Considerations¶
Memory Usage¶
Different datasets have different memory footprints:
Dataset | Approx. Memory | Tokens/Example |
---|---|---|
CoNLL-2003 | 500 MB | 15-20 |
OntoNotes | 1.5 GB | 20-30 |
Few-NERD | 2.0 GB | 25-35 |
WikiNER | 3.0 GB+ | 20-25 |
Loading Time¶
Use caching to speed up repeated loads:
Batch Composition¶
With multiple datasets, consider batch composition:
training:
per_device_train_batch_size: 8
gradient_accumulation_steps: 4
# Effective batch size: 32
# Ensures good representation from each dataset
Troubleshooting¶
Issue: Label mismatch errors¶
Solution: Check label mappings are consistent:
# Debug label mappings
for dataset_config in config.dataset_configs:
loader = registry.get_loader(dataset_config.name)
print(f"{dataset_config.name}: {loader.get_labels()}")
Issue: Memory overflow with multiple datasets¶
Solution: Use streaming or reduce dataset sizes:
data:
dataset_configs:
- name: large_dataset
max_examples: 10000 # Limit examples
streaming: true # Use streaming mode
Issue: Imbalanced training with mixed datasets¶
Solution: Adjust mixing strategy:
Next Steps¶
Optimize Your Training
Now that you understand datasets, explore Hyperparameter Tuning to find the best configuration for your data combination.