Multilang Preprocessing
The multilang preprocessor detects word-level languages with
lingua-language-detector and returns OverrideSpan objects for
language switching. It integrates with the span-based phonemization API.
API
- kokorog2p.multilang.preprocess_multilang(text: str, default_language: str = 'en-us', allowed_languages: list[str] | None = None, confidence_threshold: float = 0.7, phrase_overrides: dict[str, str] | None = None, min_token_length: int = 3) list[OverrideSpan][source]
Detect word-level languages and return OverrideSpan objects.
Returns OverrideSpan objects for language switching.
- Args:
text: Input text to annotate. default_language: Base language for unmarked words. allowed_languages: Language codes to detect (must include default_language). confidence_threshold: Minimum confidence (0.0-1.0) to accept detection. phrase_overrides: Optional dict mapping exact phrases to language codes. min_token_length: Minimum token length for detection (default: 3).
- Returns:
List of OverrideSpan objects with language overrides for detected words.
- Raises:
ImportError: If lingua-language-detector is not installed. ValueError: If allowed_languages is missing or default_language not allowed.
Examples
Basic Usage
from kokorog2p import phonemize
from kokorog2p.multilang import preprocess_multilang
text = "Schöne World"
overrides = preprocess_multilang(
text,
default_language="en-us",
allowed_languages=["en-us", "de"],
)
result = phonemize(text, language="en-us", overrides=overrides)
Confidence Tuning
from kokorog2p.multilang import preprocess_multilang
overrides = preprocess_multilang(
"Bonjour World",
default_language="en-us",
allowed_languages=["en-us", "fr"],
confidence_threshold=0.5,
)