Datasets for NER Training¶

Overview¶

Mistral NER supports 9 diverse datasets out of the box, ranging from traditional NER benchmarks to specialized PII detection datasets. The system provides a unified interface for loading, preprocessing, and mixing multiple datasets, enabling you to train models on combined data sources for improved generalization.

Dataset Catalog¶

Traditional NER Datasets¶

#### CoNLL-2003 The gold standard for NER evaluation, featuring news articles in multiple languages.

4

Entity Types

22,137

Sentences

En, De, Es, Nl

Languages

**Entity Types**: PER (Person), ORG (Organization), LOC (Location), MISC (Miscellaneous) **Configuration**:

data:
  dataset_configs:
    - name: conll2003
      split: train
      language: en  # Options: en, de, es, nl

**Example**:

[PER John Smith] works at [ORG Microsoft] in [LOC Seattle].

#### OntoNotes 5.0 A large-scale multilingual dataset with fine-grained entity types.

18

Entity Types

59,924

Sentences

Multiple

Domains

**Entity Types**: PERSON, ORG, GPE, DATE, CARDINAL, MONEY, PERCENT, TIME, LOC, FAC, NORP, EVENT, LAW, PRODUCT, QUANTITY, WORK_OF_ART, LANGUAGE, ORDINAL **Configuration**:

data:
  dataset_configs:
    - name: ontonotes
      split: train
      subset: english_v4  # or english_v12, chinese_v4, arabic_v4

#### WNUT-17 (Emerging Entities) Focuses on unusual, previously-unseen entities in social media.

6

Entity Types

5,690

Sentences

Social Media

Domain

**Entity Types**: person, location, corporation, product, creative-work, group **Configuration**:

data:
  dataset_configs:
    - name: wnut17
      split: train

#### Few-NERD Fine-grained entity recognition with 66 entity types organized hierarchically.

66

Fine Types

8

Coarse Types

188,238

Sentences

**Coarse Types**: Location, Person, Organization, Building, Art, Product, Event, Other **Configuration**:

data:
  dataset_configs:
    - name: fewnerd
      split: train
      label_type: coarse  # or fine
      sampling_strategy: inter  # or intra

#### WikiNER Automatically annotated entities from Wikipedia articles.

3

Entity Types

Large Scale

Size

Multi-lingual

Coverage

**Entity Types**: PER, ORG, LOC **Configuration**:

data:
  dataset_configs:
    - name: wikiner
      split: train
      language: en  # Many languages available

PII Detection Datasets¶

#### Gretel AI Synthetic PII (Finance) High-quality synthetic data for financial PII detection.

29

PII Types

105,411

Records

Finance

Domain

**PII Types**: credit_card, ssn, email, phone, address, account_number, routing_number, swift_code, and more **Configuration**:

data:
  dataset_configs:
    - name: gretel_pii
      split: train

#### AI4Privacy PII Masking Comprehensive PII detection with 54 distinct classes.

54

PII Classes

~65K

Examples

Comprehensive

Coverage

**PII Types**: NAME, EMAIL, PHONE, ADDRESS, SSN, CREDIT_CARD, DATE_OF_BIRTH, MEDICAL_RECORD, and many more **Configuration**:

data:
  dataset_configs:
    - name: ai4privacy
      split: train

#### Mendeley Synthetic PII Large-scale synthetic PII dataset with realistic patterns.

200K

Examples

Synthetic

Data Type

Multi-domain

Coverage

**Configuration**:

data:
  dataset_configs:
    - name: mendeley_pii
      split: train

#### BigCode PII (Software) Specialized for detecting PII in source code and technical documentation.

Code-specific

Focus

Gated

Access

Software

Domain

**Note**: Requires HuggingFace authentication token. **Configuration**:

data:
  dataset_configs:
    - name: bigcode_pii
      split: train
      auth_token: ${HF_TOKEN}  # Set in environment

Multi-Dataset Training¶

One of Mistral NER's most powerful features is the ability to train on multiple datasets simultaneously, leveraging diverse data sources for better generalization.

Mixing Strategies¶

1. Concatenation (Default)¶

Simply combines all datasets into one large training set.

data:
  dataset_configs:
    - name: conll2003
    - name: ontonotes
  mixing_strategy: concatenate

Use when: Datasets have similar characteristics and you want maximum data.

2. Interleaving¶

Alternates between datasets during training, ensuring balanced exposure.

data:
  dataset_configs:
    - name: conll2003
    - name: gretel_pii
  mixing_strategy: interleave
  interleave_probs: [0.7, 0.3]  # 70% CoNLL, 30% Gretel

Use when: Datasets have different sizes and you want balanced training.

3. Weighted Sampling¶

Samples from datasets based on specified weights.

data:
  dataset_configs:
    - name: conll2003
      weight: 2.0
    - name: wnut17
      weight: 1.0
    - name: ai4privacy
      weight: 1.5
  mixing_strategy: weighted
  sampling_temperature: 1.0  # Controls randomness

Use when: You want fine control over dataset contribution.

Label Mapping System¶

Mistral NER provides a flexible label mapping system to unify different label schemas across datasets. This is essential when combining datasets with different naming conventions or when you want to merge certain entity types.

Configuration Methods¶

1. Profile-Based Mapping (Recommended)¶

Use predefined mapping profiles for common use cases:

data:
  multi_dataset:
    enabled: true
    dataset_names: ["conll2003", "ontonotes", "gretel_pii", "ai4privacy"]
    label_mapping_profile: "bank_pii"  # Predefined profile

Available profiles: - bank_pii: Maps all location types to ADDR for banking/financial PII detection - general: Preserves most entity distinctions (default behavior)

2. External Mapping Files¶

Define mappings in separate YAML files for better organization:

data:
  multi_dataset:
    label_mappings:
      conll2003: "conll2003_bank_pii.yaml"
      ontonotes: "ontonotes_custom.yaml"
      gretel_pii: "gretel_pii_custom.yaml"

Example mapping file (configs/mappings/conll2003_bank_pii.yaml):

# Map CoNLL-2003 labels to unified schema
O: O
B-PER: B-PER
I-PER: I-PER
B-ORG: B-ORG
I-ORG: I-ORG
B-LOC: B-ADDR  # Map locations to addresses
I-LOC: I-ADDR
B-MISC: B-MISC
I-MISC: I-MISC

3. Inline Mapping¶

Define mappings directly in the configuration:

data:
  multi_dataset:
    label_mappings:
      conll2003:
        B-LOC: B-ADDR
        I-LOC: I-ADDR
      ontonotes:
        B-GPE: B-ADDR
        I-GPE: I-ADDR
        B-FAC: B-ADDR
        I-FAC: I-ADDR

Label Unification Example¶

When combining datasets, label conflicts are resolved using your chosen mapping:

graph LR
    subgraph "Dataset 1 (CoNLL)"
        A1[PER] --> U1[PER]
        A2[ORG] --> U2[ORG]
        A3[LOC] --> U3[ADDR]
    end

    subgraph "Dataset 2 (OntoNotes)"
        B1[PERSON] --> U1
        B2[ORG] --> U2
        B3[GPE] --> U3
        B4[FAC] --> U3
    end

    subgraph "Unified Schema"
        U1[PER - Person]
        U2[ORG - Organization]
        U3[ADDR - All Locations]
    end

Creating Custom Mapping Profiles¶

Add new profiles to src/datasets/mapping_profiles.py:

class MappingProfiles:
    # Custom profile for medical domain
    MEDICAL = {
        "conll2003": {
            "B-PER": "B-PATIENT",
            "I-PER": "I-PATIENT",
            "B-ORG": "B-PROVIDER",
            "I-ORG": "I-PROVIDER",
            # ... more mappings
        },
        "ontonotes": {
            # ... dataset-specific mappings
        }
    }

Configuration Examples¶

Example 1: Combined Traditional NER¶

data:
  dataset_configs:
    - name: conll2003
      split: train
    - name: ontonotes
      split: train
      subset: english_v4
    - name: wnut17
      split: train
  mixing_strategy: interleave
  max_length: 256
  label_all_tokens: false

Example 2: PII Detection Focus¶

data:
  dataset_configs:
    - name: gretel_pii
      weight: 2.0
    - name: ai4privacy
      weight: 2.0
    - name: mendeley_pii
      weight: 1.0
    - name: conll2003  # For general NER capability
      weight: 0.5
  mixing_strategy: weighted
  sampling_temperature: 0.5  # More deterministic sampling

Example 3: Comprehensive Model¶

data:
  dataset_configs:
    - name: conll2003
    - name: ontonotes
    - name: fewnerd
      label_type: coarse
    - name: gretel_pii
    - name: ai4privacy
  mixing_strategy: concatenate
  # Unified label schema will be created automatically

Adding Custom Datasets¶

Step 1: Create Dataset Loader¶

Create a new loader in src/datasets/loaders/:

from typing import Dict, List, Any
from datasets import Dataset, load_dataset
from ..base import BaseNERDataset

class MyCustomDataset(BaseNERDataset):
    """Custom dataset loader."""

    LABEL_MAPPING = {
        "person": "PERSON",
        "place": "LOCATION",
        "company": "ORGANIZATION",
        # Add your mappings
    }

    def load(self) -> Dataset:
        """Load the dataset."""
        # Option 1: Load from HuggingFace
        dataset = load_dataset("username/dataset-name", split=self.split)

        # Option 2: Load from local files
        # dataset = self._load_from_files("path/to/data")

        return dataset

    def get_labels(self) -> List[str]:
        """Return list of labels."""
        return ["O", "B-PERSON", "I-PERSON", "B-LOCATION", ...]

Step 2: Register Dataset¶

Add to the registry in src/datasets/registry.py:

from .loaders.my_custom import MyCustomDataset

# In DatasetRegistry._register_default_loaders()
self.register("my_custom", MyCustomDataset)

Step 3: Use in Configuration¶

data:
  dataset_configs:
    - name: my_custom
      split: train
      # Custom parameters if needed

Dataset Statistics and Analysis¶

Understanding Your Data¶

Use the built-in analysis tools:

from src.datasets import DatasetRegistry, DatasetMixer

# Load datasets
registry = DatasetRegistry()
datasets = [
    registry.get_loader("conll2003").load(),
    registry.get_loader("gretel_pii").load()
]

# Analyze
mixer = DatasetMixer(strategy="concatenate")
mixed_data = mixer.mix(datasets)

# Get statistics
print(f"Total examples: {len(mixed_data)}")
print(f"Label distribution: {mixer.get_label_distribution()}")
print(f"Average sequence length: {mixer.get_avg_sequence_length()}")

Visualization¶

The system provides visualization tools for dataset analysis:

from src.visualization import plot_label_distribution, plot_dataset_comparison

# Visualize label distribution
plot_label_distribution(train_dataset, save_path="label_dist.png")

# Compare datasets
plot_dataset_comparison(
    datasets={"CoNLL": conll_data, "OntoNotes": onto_data},
    save_path="dataset_comparison.png"
)

Best Practices¶

1. Start Simple¶

Begin with a single dataset to establish baseline performance:

data:
  dataset_configs:
    - name: conll2003

2. Add Gradually¶

Add datasets one at a time, monitoring performance:

# Iteration 1: Baseline
- name: conll2003

# Iteration 2: Add similar dataset
- name: conll2003
- name: ontonotes

# Iteration 3: Add specialized data
- name: conll2003  
- name: ontonotes
- name: gretel_pii

3. Balance Dataset Sizes¶

Use weights or interleaving to prevent large datasets from dominating:

data:
  dataset_configs:
    - name: large_dataset
      weight: 1.0
    - name: small_dataset  
      weight: 5.0  # Upweight smaller dataset

4. Monitor Label Distribution¶

Ensure rare labels are adequately represented:

# Check label distribution after mixing
label_counts = mixer.get_label_distribution()
for label, count in sorted(label_counts.items()):
    print(f"{label}: {count} ({count/total*100:.1f}%)")

Performance Considerations¶

Memory Usage¶

Different datasets have different memory footprints:

Dataset	Approx. Memory	Tokens/Example
CoNLL-2003	500 MB	15-20
OntoNotes	1.5 GB	20-30
Few-NERD	2.0 GB	25-35
WikiNER	3.0 GB+	20-25

Loading Time¶

Use caching to speed up repeated loads:

data:
  cache_dir: ~/.cache/mistral_ner
  preprocessing_num_workers: 4

Batch Composition¶

With multiple datasets, consider batch composition:

training:
  per_device_train_batch_size: 8
  gradient_accumulation_steps: 4
  # Effective batch size: 32
  # Ensures good representation from each dataset

Troubleshooting¶

Issue: Label mismatch errors¶

Solution: Check label mappings are consistent:

# Debug label mappings
for dataset_config in config.dataset_configs:
    loader = registry.get_loader(dataset_config.name)
    print(f"{dataset_config.name}: {loader.get_labels()}")

Issue: Memory overflow with multiple datasets¶

Solution: Use streaming or reduce dataset sizes:

data:
  dataset_configs:
    - name: large_dataset
      max_examples: 10000  # Limit examples
  streaming: true  # Use streaming mode

Issue: Imbalanced training with mixed datasets¶

Solution: Adjust mixing strategy:

data:
  mixing_strategy: interleave
  interleave_probs: null  # Auto-calculate based on dataset sizes

Next Steps¶

Optimize Your Training

Now that you understand datasets, explore Hyperparameter Tuning to find the best configuration for your data combination.