Skip to content

Datasets for NER Training

Datasets Multi-Dataset Languages

Overview

Mistral NER supports 9 diverse datasets out of the box, ranging from traditional NER benchmarks to specialized PII detection datasets. The system provides a unified interface for loading, preprocessing, and mixing multiple datasets, enabling you to train models on combined data sources for improved generalization.

Dataset Catalog

Traditional NER Datasets

#### CoNLL-2003 The gold standard for NER evaluation, featuring news articles in multiple languages.
4
Entity Types
22,137
Sentences
En, De, Es, Nl
Languages
**Entity Types**: PER (Person), ORG (Organization), LOC (Location), MISC (Miscellaneous) **Configuration**:
data:
  dataset_configs:
    - name: conll2003
      split: train
      language: en  # Options: en, de, es, nl
**Example**:
[PER John Smith] works at [ORG Microsoft] in [LOC Seattle].
#### OntoNotes 5.0 A large-scale multilingual dataset with fine-grained entity types.
18
Entity Types
59,924
Sentences
Multiple
Domains
**Entity Types**: PERSON, ORG, GPE, DATE, CARDINAL, MONEY, PERCENT, TIME, LOC, FAC, NORP, EVENT, LAW, PRODUCT, QUANTITY, WORK_OF_ART, LANGUAGE, ORDINAL **Configuration**:
data:
  dataset_configs:
    - name: ontonotes
      split: train
      subset: english_v4  # or english_v12, chinese_v4, arabic_v4
#### WNUT-17 (Emerging Entities) Focuses on unusual, previously-unseen entities in social media.
6
Entity Types
5,690
Sentences
Social Media
Domain
**Entity Types**: person, location, corporation, product, creative-work, group **Configuration**:
data:
  dataset_configs:
    - name: wnut17
      split: train
#### Few-NERD Fine-grained entity recognition with 66 entity types organized hierarchically.
66
Fine Types
8
Coarse Types
188,238
Sentences
**Coarse Types**: Location, Person, Organization, Building, Art, Product, Event, Other **Configuration**:
data:
  dataset_configs:
    - name: fewnerd
      split: train
      label_type: coarse  # or fine
      sampling_strategy: inter  # or intra
#### WikiNER Automatically annotated entities from Wikipedia articles.
3
Entity Types
Large Scale
Size
Multi-lingual
Coverage
**Entity Types**: PER, ORG, LOC **Configuration**:
data:
  dataset_configs:
    - name: wikiner
      split: train
      language: en  # Many languages available

PII Detection Datasets

#### Gretel AI Synthetic PII (Finance) High-quality synthetic data for financial PII detection.
29
PII Types
105,411
Records
Finance
Domain
**PII Types**: credit_card, ssn, email, phone, address, account_number, routing_number, swift_code, and more **Configuration**:
data:
  dataset_configs:
    - name: gretel_pii
      split: train
#### AI4Privacy PII Masking Comprehensive PII detection with 54 distinct classes.
54
PII Classes
~65K
Examples
Comprehensive
Coverage
**PII Types**: NAME, EMAIL, PHONE, ADDRESS, SSN, CREDIT_CARD, DATE_OF_BIRTH, MEDICAL_RECORD, and many more **Configuration**:
data:
  dataset_configs:
    - name: ai4privacy
      split: train
#### Mendeley Synthetic PII Large-scale synthetic PII dataset with realistic patterns.
200K
Examples
Synthetic
Data Type
Multi-domain
Coverage
**Configuration**:
data:
  dataset_configs:
    - name: mendeley_pii
      split: train
#### BigCode PII (Software) Specialized for detecting PII in source code and technical documentation.
Code-specific
Focus
Gated
Access
Software
Domain
**Note**: Requires HuggingFace authentication token. **Configuration**:
data:
  dataset_configs:
    - name: bigcode_pii
      split: train
      auth_token: ${HF_TOKEN}  # Set in environment

Multi-Dataset Training

One of Mistral NER's most powerful features is the ability to train on multiple datasets simultaneously, leveraging diverse data sources for better generalization.

Mixing Strategies

1. Concatenation (Default)

Simply combines all datasets into one large training set.

data:
  dataset_configs:
    - name: conll2003
    - name: ontonotes
  mixing_strategy: concatenate

Use when: Datasets have similar characteristics and you want maximum data.

2. Interleaving

Alternates between datasets during training, ensuring balanced exposure.

data:
  dataset_configs:
    - name: conll2003
    - name: gretel_pii
  mixing_strategy: interleave
  interleave_probs: [0.7, 0.3]  # 70% CoNLL, 30% Gretel

Use when: Datasets have different sizes and you want balanced training.

3. Weighted Sampling

Samples from datasets based on specified weights.

data:
  dataset_configs:
    - name: conll2003
      weight: 2.0
    - name: wnut17
      weight: 1.0
    - name: ai4privacy
      weight: 1.5
  mixing_strategy: weighted
  sampling_temperature: 1.0  # Controls randomness

Use when: You want fine control over dataset contribution.

Label Mapping System

Mistral NER provides a flexible label mapping system to unify different label schemas across datasets. This is essential when combining datasets with different naming conventions or when you want to merge certain entity types.

Configuration Methods

Use predefined mapping profiles for common use cases:

data:
  multi_dataset:
    enabled: true
    dataset_names: ["conll2003", "ontonotes", "gretel_pii", "ai4privacy"]
    label_mapping_profile: "bank_pii"  # Predefined profile

Available profiles: - bank_pii: Maps all location types to ADDR for banking/financial PII detection - general: Preserves most entity distinctions (default behavior)

2. External Mapping Files

Define mappings in separate YAML files for better organization:

data:
  multi_dataset:
    label_mappings:
      conll2003: "conll2003_bank_pii.yaml"
      ontonotes: "ontonotes_custom.yaml"
      gretel_pii: "gretel_pii_custom.yaml"

Example mapping file (configs/mappings/conll2003_bank_pii.yaml):

# Map CoNLL-2003 labels to unified schema
O: O
B-PER: B-PER
I-PER: I-PER
B-ORG: B-ORG
I-ORG: I-ORG
B-LOC: B-ADDR  # Map locations to addresses
I-LOC: I-ADDR
B-MISC: B-MISC
I-MISC: I-MISC

3. Inline Mapping

Define mappings directly in the configuration:

data:
  multi_dataset:
    label_mappings:
      conll2003:
        B-LOC: B-ADDR
        I-LOC: I-ADDR
      ontonotes:
        B-GPE: B-ADDR
        I-GPE: I-ADDR
        B-FAC: B-ADDR
        I-FAC: I-ADDR

Label Unification Example

When combining datasets, label conflicts are resolved using your chosen mapping:

graph LR
    subgraph "Dataset 1 (CoNLL)"
        A1[PER] --> U1[PER]
        A2[ORG] --> U2[ORG]
        A3[LOC] --> U3[ADDR]
    end

    subgraph "Dataset 2 (OntoNotes)"
        B1[PERSON] --> U1
        B2[ORG] --> U2
        B3[GPE] --> U3
        B4[FAC] --> U3
    end

    subgraph "Unified Schema"
        U1[PER - Person]
        U2[ORG - Organization]
        U3[ADDR - All Locations]
    end

Creating Custom Mapping Profiles

Add new profiles to src/datasets/mapping_profiles.py:

class MappingProfiles:
    # Custom profile for medical domain
    MEDICAL = {
        "conll2003": {
            "B-PER": "B-PATIENT",
            "I-PER": "I-PATIENT",
            "B-ORG": "B-PROVIDER",
            "I-ORG": "I-PROVIDER",
            # ... more mappings
        },
        "ontonotes": {
            # ... dataset-specific mappings
        }
    }

Configuration Examples

Example 1: Combined Traditional NER

data:
  dataset_configs:
    - name: conll2003
      split: train
    - name: ontonotes
      split: train
      subset: english_v4
    - name: wnut17
      split: train
  mixing_strategy: interleave
  max_length: 256
  label_all_tokens: false

Example 2: PII Detection Focus

data:
  dataset_configs:
    - name: gretel_pii
      weight: 2.0
    - name: ai4privacy
      weight: 2.0
    - name: mendeley_pii
      weight: 1.0
    - name: conll2003  # For general NER capability
      weight: 0.5
  mixing_strategy: weighted
  sampling_temperature: 0.5  # More deterministic sampling

Example 3: Comprehensive Model

data:
  dataset_configs:
    - name: conll2003
    - name: ontonotes
    - name: fewnerd
      label_type: coarse
    - name: gretel_pii
    - name: ai4privacy
  mixing_strategy: concatenate
  # Unified label schema will be created automatically

Adding Custom Datasets

Step 1: Create Dataset Loader

Create a new loader in src/datasets/loaders/:

from typing import Dict, List, Any
from datasets import Dataset, load_dataset
from ..base import BaseNERDataset

class MyCustomDataset(BaseNERDataset):
    """Custom dataset loader."""

    LABEL_MAPPING = {
        "person": "PERSON",
        "place": "LOCATION",
        "company": "ORGANIZATION",
        # Add your mappings
    }

    def load(self) -> Dataset:
        """Load the dataset."""
        # Option 1: Load from HuggingFace
        dataset = load_dataset("username/dataset-name", split=self.split)

        # Option 2: Load from local files
        # dataset = self._load_from_files("path/to/data")

        return dataset

    def get_labels(self) -> List[str]:
        """Return list of labels."""
        return ["O", "B-PERSON", "I-PERSON", "B-LOCATION", ...]

Step 2: Register Dataset

Add to the registry in src/datasets/registry.py:

from .loaders.my_custom import MyCustomDataset

# In DatasetRegistry._register_default_loaders()
self.register("my_custom", MyCustomDataset)

Step 3: Use in Configuration

data:
  dataset_configs:
    - name: my_custom
      split: train
      # Custom parameters if needed

Dataset Statistics and Analysis

Understanding Your Data

Use the built-in analysis tools:

from src.datasets import DatasetRegistry, DatasetMixer

# Load datasets
registry = DatasetRegistry()
datasets = [
    registry.get_loader("conll2003").load(),
    registry.get_loader("gretel_pii").load()
]

# Analyze
mixer = DatasetMixer(strategy="concatenate")
mixed_data = mixer.mix(datasets)

# Get statistics
print(f"Total examples: {len(mixed_data)}")
print(f"Label distribution: {mixer.get_label_distribution()}")
print(f"Average sequence length: {mixer.get_avg_sequence_length()}")

Visualization

The system provides visualization tools for dataset analysis:

from src.visualization import plot_label_distribution, plot_dataset_comparison

# Visualize label distribution
plot_label_distribution(train_dataset, save_path="label_dist.png")

# Compare datasets
plot_dataset_comparison(
    datasets={"CoNLL": conll_data, "OntoNotes": onto_data},
    save_path="dataset_comparison.png"
)

Best Practices

1. Start Simple

Begin with a single dataset to establish baseline performance:

data:
  dataset_configs:
    - name: conll2003

2. Add Gradually

Add datasets one at a time, monitoring performance:

# Iteration 1: Baseline
- name: conll2003

# Iteration 2: Add similar dataset
- name: conll2003
- name: ontonotes

# Iteration 3: Add specialized data
- name: conll2003  
- name: ontonotes
- name: gretel_pii

3. Balance Dataset Sizes

Use weights or interleaving to prevent large datasets from dominating:

data:
  dataset_configs:
    - name: large_dataset
      weight: 1.0
    - name: small_dataset  
      weight: 5.0  # Upweight smaller dataset

4. Monitor Label Distribution

Ensure rare labels are adequately represented:

# Check label distribution after mixing
label_counts = mixer.get_label_distribution()
for label, count in sorted(label_counts.items()):
    print(f"{label}: {count} ({count/total*100:.1f}%)")

Performance Considerations

Memory Usage

Different datasets have different memory footprints:

Dataset Approx. Memory Tokens/Example
CoNLL-2003 500 MB 15-20
OntoNotes 1.5 GB 20-30
Few-NERD 2.0 GB 25-35
WikiNER 3.0 GB+ 20-25

Loading Time

Use caching to speed up repeated loads:

data:
  cache_dir: ~/.cache/mistral_ner
  preprocessing_num_workers: 4

Batch Composition

With multiple datasets, consider batch composition:

training:
  per_device_train_batch_size: 8
  gradient_accumulation_steps: 4
  # Effective batch size: 32
  # Ensures good representation from each dataset

Troubleshooting

Issue: Label mismatch errors

Solution: Check label mappings are consistent:

# Debug label mappings
for dataset_config in config.dataset_configs:
    loader = registry.get_loader(dataset_config.name)
    print(f"{dataset_config.name}: {loader.get_labels()}")

Issue: Memory overflow with multiple datasets

Solution: Use streaming or reduce dataset sizes:

data:
  dataset_configs:
    - name: large_dataset
      max_examples: 10000  # Limit examples
  streaming: true  # Use streaming mode

Issue: Imbalanced training with mixed datasets

Solution: Adjust mixing strategy:

data:
  mixing_strategy: interleave
  interleave_probs: null  # Auto-calculate based on dataset sizes

Next Steps

Optimize Your Training

Now that you understand datasets, explore Hyperparameter Tuning to find the best configuration for your data combination.