Advanced Usage ============== This guide covers advanced features and usage patterns for kokorog2p. Custom G2P Configuration ------------------------ Memory-Efficient Loading ~~~~~~~~~~~~~~~~~~~~~~~~ Control dictionary loading to optimize memory and initialization time: .. code-block:: python from kokorog2p import get_g2p # Default: Gold + Silver dictionaries (~365k entries, ~57 MB) # Provides maximum vocabulary coverage g2p = get_g2p("en-us") # Memory-optimized: Gold dictionary only (~179k entries, ~35 MB) # Saves ~22-31 MB memory and ~400-470 ms initialization time g2p_fast = get_g2p("en-us", load_silver=False) # Ultra-fast initialization: No dictionaries (~7 MB, espeak fallback only) # Saves ~50+ MB memory, fastest initialization g2p_minimal = get_g2p("en-us", load_silver=False, load_gold=False) # Check dictionary size print(f"Gold entries: {len(g2p.lexicon.golds):,}") print(f"Silver entries: {len(g2p.lexicon.silvers):,}") **Dictionary loading configurations:** * ``load_gold=True, load_silver=True``: Maximum coverage (default, ~365k entries) * ``load_gold=True, load_silver=False``: Common words only (~179k entries, -22-31 MB) * ``load_gold=False, load_silver=True``: Extended vocabulary only (unusual, ~187k entries) * ``load_gold=False, load_silver=False``: Ultra-fast (espeak only, -50+ MB) **When to disable dictionaries:** * **Disable silver** (``load_silver=False``): * Resource-constrained environments (limited memory) * Real-time applications (faster initialization) * You only need common vocabulary * Production deployments where performance is critical * **Disable both** (``load_gold=False, load_silver=False``): * Ultra-fast initialization is critical * You're fine with espeak-only fallback * Minimal memory footprint required * Testing or prototyping **Default (both enabled) provides:** * Maximum vocabulary coverage (~365k total entries) * Best phoneme quality from curated dictionaries * Backward compatibility with existing code Disabling Features ~~~~~~~~~~~~~~~~~~ You can disable specific features for better performance or control: .. code-block:: python from kokorog2p.en import EnglishG2P # Disable espeak fallback g2p = EnglishG2P( language="en-us", use_espeak_fallback=False, # Unknown words will have no phonemes use_spacy=True, spacy_model="en_core_web_md", # default ) # Disable spaCy (faster but no POS tagging) g2p = EnglishG2P( language="en-us", use_espeak_fallback=True, use_spacy=False # Faster tokenization ) # Minimal configuration (fastest) g2p = EnglishG2P( language="en-us", use_espeak_fallback=False, use_spacy=False, load_silver=False, load_gold=False # No dictionaries, ultra-fast ) spaCy Model Selection (English) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ English G2P lets you choose which spaCy model to use for POS tagging. This affects homograph and heteronym disambiguation quality (for example, ``lives`` noun vs verb). .. code-block:: python from kokorog2p.en import EnglishG2P # Default (recommended balance) g2p_md = EnglishG2P(use_spacy=True, spacy_model="en_core_web_md") # Smaller model (lower memory / faster downloads) g2p_sm = EnglishG2P(use_spacy=True, spacy_model="en_core_web_sm") # Largest model (highest spaCy English accuracy, highest memory) g2p_lg = EnglishG2P(use_spacy=True, spacy_model="en_core_web_lg") The same option is also available through ``get_g2p()``: .. code-block:: python from kokorog2p import get_g2p g2p = get_g2p("en-us", use_spacy=True, spacy_model="en_core_web_md") Stress Control ~~~~~~~~~~~~~~ Control stress marker output: .. code-block:: python from kokorog2p.de import GermanG2P # Strip stress markers from output g2p = GermanG2P( language="de-de", strip_stress=True # Remove ˈ and ˌ markers ) Token Inspection ---------------- Tokens contain detailed information: .. code-block:: python from kokorog2p import get_g2p g2p = get_g2p("en-us", use_spacy=True) tokens = g2p("I can't believe it!") for token in tokens: # Basic attributes print(f"Text: {token.text}") print(f"Phonemes: {token.phonemes}") print(f"POS tag: {token.tag}") print(f"Whitespace: '{token.whitespace}'") # Additional metadata rating = token.get("rating") # 5=dictionary, 2=espeak, 0=unknown print(f"Rating: {rating}") # Check token type is_punct = not any(c.isalnum() for c in token.text) print(f"Is punctuation: {is_punct}") Rating System ~~~~~~~~~~~~~ Tokens have a rating indicating the source of phonemes: * **5**: User-provided (via OverrideSpan) or gold dictionary (highest quality) * **4**: Punctuation * **3**: Silver dictionary or rule-based conversion * **2**: From espeak-ng fallback * **1**: From goruut backend * **0**: Unknown/failed .. code-block:: python from kokorog2p import get_g2p g2p = get_g2p("en-us") tokens = g2p("Hello xyznotaword!") for token in tokens: rating = token.get("rating", 0) if rating == 5: print(f"{token.text}: High quality (gold dictionary)") elif rating == 3: print(f"{token.text}: Silver dictionary") elif rating == 2: print(f"{token.text}: Fallback (espeak)") elif rating == 0: print(f"{token.text}: Unknown") Dictionary Lookup ----------------- Direct dictionary access: .. code-block:: python from kokorog2p.en import EnglishG2P # Load with or without silver dataset g2p_gold = EnglishG2P(language="en-us", load_silver=False) g2p_full = EnglishG2P(language="en-us", load_silver=True) # Simple lookup phonemes = g2p_gold.lexicon.lookup("hello") print(phonemes) # həlˈO # Check if word is in dictionary if g2p_gold.lexicon.is_known("hello"): print("Word is in gold dictionary") # Get dictionary sizes print(f"Gold: {len(g2p_gold.lexicon.golds):,} entries") print(f"Silver: {len(g2p_full.lexicon.silvers):,} entries") # POS-aware lookup phonemes_verb = g2p_gold.lexicon.lookup("read", tag="VB") # ɹˈid (present) phonemes_past = g2p_gold.lexicon.lookup("read", tag="VBD") # ɹˈɛd (past) German Lexicon ~~~~~~~~~~~~~~ .. code-block:: python from kokorog2p.de import GermanLexicon lexicon = GermanLexicon(strip_stress=False) phonemes = lexicon.lookup("Haus") print(phonemes) # haʊ̯s print(f"Dictionary has {len(lexicon):,} entries") # 738,427 Phoneme Utilities ----------------- Validation ~~~~~~~~~~ Validate phonemes against Kokoro vocabulary: .. code-block:: python from kokorog2p import validate_phonemes, get_vocab # Check if phonemes are valid valid = validate_phonemes("hˈɛlO") print(valid) # True invalid = validate_phonemes("xyz123") print(invalid) # False # Get the full vocabulary vocab = get_vocab("us") print(f"US vocabulary: {len(vocab)} phonemes") Conversion ~~~~~~~~~~ Convert between different phoneme formats: .. code-block:: python from kokorog2p import from_espeak, to_espeak # Convert espeak IPA to Kokoro espeak_ipa = "həlˈəʊ" kokoro_phonemes = from_espeak(espeak_ipa, variant="us") print(kokoro_phonemes) # hˈɛlO # Convert Kokoro to espeak IPA kokoro = "hˈɛlO" espeak = to_espeak(kokoro, variant="us") print(espeak) Vocabulary Encoding ------------------- Convert phonemes to IDs for model input: .. code-block:: python from kokorog2p import phonemes_to_ids, ids_to_phonemes # Encode phonemes phonemes = "hˈɛlO wˈɜɹld" ids = phonemes_to_ids(phonemes) print(ids) # [12, 45, 23, ...] # Decode back decoded = ids_to_phonemes(ids) print(decoded) # hˈɛlO wˈɜɹld # Get Kokoro vocabulary from kokorog2p import get_kokoro_vocab vocab = get_kokoro_vocab() print(f"Kokoro has {len(vocab)} tokens") Quote Handling -------------- kokorog2p provides sophisticated quote handling with support for nested quotes and automatic conversion to curly quotes. Nested Quote Detection ~~~~~~~~~~~~~~~~~~~~~~ The tokenizer supports two modes for handling quotes: .. code-block:: python from kokorog2p import get_g2p # Default: Bracket-matching mode (supports nesting) g2p = get_g2p("en-us") tokens = g2p('He said "She used `backticks` here"') # Check quote depths for token in tokens: depth = token.quote_depth print(f"{token.text}: depth={depth}") # Output shows nesting: "=1, `=2, `=2, "=1 **Bracket-Matching Mode** (default): * Supports nested quotes when using **different** quote characters * Maintains a stack to track nesting depth * Supported quote characters: ``"`` (double quote), `````` (backtick), ``'`` (single quote) * Depth increases with each level of nesting (1 = outermost, 2 = nested once, etc.) **Important**: Nesting only works with different quote types: * ✅ **Supported**: ``"outer `inner` text"`` → depths ``[1, 2, 2, 1]`` (different quotes) * ❌ **NOT supported**: ``"level1 "level2""`` → depths ``[1, 1, 1, 1]`` (same quotes alternate) Examples: .. code-block:: python from kokorog2p.pipeline.tokenizer import RegexTokenizer # Create tokenizer with bracket matching (default) tokenizer = RegexTokenizer(use_bracket_matching=True) # Simple pair tokens = tokenizer.tokenize('"hello"', '"hello"') # Quote depths: [1, 1] # Nested quotes (different types) tokens = tokenizer.tokenize('"outer `inner` text"', '"outer `inner` text"') # Quote depths: [1, 2, 2, 1] # Multiple separate pairs tokens = tokenizer.tokenize('"first" and "second"', '"first" and "second"') # Quote depths: [1, 1, 1, 1] # Triple nesting (different types) tokens = tokenizer.tokenize('"a `b \'c\' d` e"', '"a `b \'c\' d` e"') # Quote depths: [1, 2, 3, 3, 2, 1] **Simple Alternation Mode**: For simpler use cases without nesting support: .. code-block:: python from kokorog2p.pipeline.tokenizer import RegexTokenizer # Disable bracket matching for simple alternation tokenizer = RegexTokenizer(use_bracket_matching=False) # First quote opens (depth 1), second closes (depth 0) tokens = tokenizer.tokenize('"hello" world', '"hello" world') # Quote depths: [1, 0, 0] Curly Quote Conversion ~~~~~~~~~~~~~~~~~~~~~~ The tokenizer automatically converts straight quotes to curly quotes based on nesting depth: .. code-block:: python from kokorog2p import get_g2p g2p = get_g2p("en-us") # Straight quotes converted to curly quotes tokens = g2p('She said "hello"') # First quote becomes left curly ("), last becomes right curly (") quote_chars = [t.text for t in tokens if t.text in ('"', '"')] print(quote_chars) # ['"', '"'] **Conversion Rules**: * Opening quotes (depth increases) → left curly quote ``"`` (U+201C) * Closing quotes (depth decreases) → right curly quote ``"`` (U+201D) * Backticks follow the same pattern as double quotes * Single quotes use standard apostrophe ``'`` (U+0027) Quote Depth in Custom Processing ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Access quote depth for custom processing: .. code-block:: python from kokorog2p import get_g2p g2p = get_g2p("en-us") tokens = g2p('He said "She whispered `quietly`"') # Analyze quote nesting for token in tokens: if token.quote_depth > 0: indent = " " * (token.quote_depth - 1) print(f"{indent}[{token.quote_depth}] {token.text}") Output shows nesting structure: .. code-block:: text [1] " [1] She [1] whispered [2] ` [2] quietly [2] ` [1] " Punctuation Handling -------------------- Automatic Normalization ~~~~~~~~~~~~~~~~~~~~~~~ kokorog2p automatically normalizes punctuation variants to ensure consistency with Kokoro TTS vocabulary: .. code-block:: python from kokorog2p import get_g2p g2p = get_g2p("en-us") # Ellipsis variants → single ellipsis character (…) tokens = g2p("Wait... really?") # ... → … tokens = g2p("Wait. . . really?") # . . . → … tokens = g2p("Wait.. really?") # .. → … tokens = g2p("Wait…really?") # … preserved # Dash variants → em dash (—) tokens = g2p("Wait - what?") # spaced hyphen → em dash tokens = g2p("Wait -- what?") # double hyphen → em dash tokens = g2p("Wait – what?") # en dash → em dash tokens = g2p("Wait — what?") # em dash preserved tokens = g2p("Wait ― what?") # horizontal bar → em dash tokens = g2p("Wait ‒ what?") # figure dash → em dash tokens = g2p("Wait − what?") # minus sign → em dash # Compound words preserve hyphens (no normalization) tokens = g2p("well-known") # hyphen removed, words joined tokens = g2p("state-of-the-art") # hyphens removed, words joined **Normalization Rules:** * **Ellipsis**: All variants (``...``, ``. . .``, ``..``, ``....``) → ``…`` (U+2026) * **Em dash**: All dash types when spaced (``-``, ``--``, ``–``, ``—``, ``―``, ``‒``, ``−``) → ``—`` (U+2014) * **Hyphens in compound words**: Preserved during tokenization, then removed in phoneme output * **Apostrophes**: All variants (``'``, ``'``, ``'``, ````, ``´``, etc.) → ``'`` (U+0027) Manual Normalization ~~~~~~~~~~~~~~~~~~~~ Control punctuation normalization manually: .. code-block:: python from kokorog2p import normalize_punctuation, filter_punctuation # Normalize to Kokoro punctuation text = "Hello... world!!!" normalized = normalize_punctuation(text) print(normalized) # Hello. world! # Filter out non-Kokoro punctuation phonemes = "hˈɛlO… wˈɜɹld‼" filtered = filter_punctuation(phonemes) print(filtered) # hˈɛlO. wˈɜɹld! # Check if punctuation is valid from kokorog2p import is_kokoro_punctuation print(is_kokoro_punctuation("!")) # True print(is_kokoro_punctuation("…")) # True (normalized automatically) print(is_kokoro_punctuation("‼")) # False Word Mismatch Detection ----------------------- Detect mismatches between input text and phoneme output: .. code-block:: python from kokorog2p import detect_mismatches text = "Hello world!" phonemes = "hɛlO wɜɹld !" mismatches = detect_mismatches(text, phonemes) for mismatch in mismatches: print(f"Position {mismatch.position}:") print(f" Input word: {mismatch.input_word}") print(f" Output word: {mismatch.output_word}") print(f" Type: {mismatch.type}") Number Expansion ---------------- Customize number handling: English ~~~~~~~ .. code-block:: python from kokorog2p.en.numbers import EnglishNumberConverter converter = EnglishNumberConverter() # Cardinals print(converter.convert_cardinal("42")) # → forty-two # Ordinals print(converter.convert_ordinal("42")) # → forty-second # Years print(converter.convert_year("1984")) # → nineteen eighty-four # Currency print(converter.convert_currency("12.50", "$")) # → twelve dollars and fifty cents # Decimals print(converter.convert_decimal("3.14")) # → three point one four German ~~~~~~ .. code-block:: python from kokorog2p.de.numbers import GermanNumberConverter converter = GermanNumberConverter() # Cardinals print(converter.convert_cardinal("42")) # → zweiundvierzig # Ordinals print(converter.convert_ordinal("42")) # → zweiundvierzigste # Years print(converter.convert_year("1984")) # → neunzehnhundertvierundachtzig # Currency print(converter.convert_currency("12,50", "€")) # → zwölf Euro fünfzig Custom Backend Selection ------------------------- Choose specific backends: .. code-block:: python from kokorog2p import get_g2p # Use espeak backend g2p_espeak = get_g2p("en-us", backend="espeak") # Use goruut backend (if installed) g2p_goruut = get_g2p("en-us", backend="goruut") Direct Backend Access ~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from kokorog2p.backends.espeak import EspeakBackend # Create espeak backend backend = EspeakBackend(language="en-us") # Phonemize a word phonemes = backend.phonemize("hello") print(phonemes) Caching and Performance ----------------------- Managing Cache ~~~~~~~~~~~~~~ .. code-block:: python from kokorog2p import get_g2p, clear_cache # G2P instances are cached by language and settings g2p1 = get_g2p("en-us", use_spacy=True) g2p2 = get_g2p("en-us", use_spacy=True) assert g2p1 is g2p2 # Same instance # Different settings = different cache entry g2p3 = get_g2p("en-us", use_spacy=False) assert g2p1 is not g2p3 # Different instance # load_silver and load_gold also affect caching g2p4 = get_g2p("en-us", load_silver=False) assert g2p1 is not g2p4 # Different instance (different silver setting) g2p5 = get_g2p("en-us", load_gold=False) assert g2p1 is not g2p5 # Different instance (different gold setting) # Clear cache when needed clear_cache() Batch Processing ~~~~~~~~~~~~~~~~ For best performance when processing many texts: .. code-block:: python from kokorog2p import get_g2p # Create instance once g2p = get_g2p("en-us") texts = ["Hello", "World", "This", "Is", "Fast"] # Process many texts with same instance all_tokens = [] for text in texts: tokens = g2p(text) all_tokens.append(tokens) Custom Phoneme Filtering ------------------------- Filter phonemes for specific use cases: .. code-block:: python from kokorog2p import get_g2p, validate_for_kokoro, filter_for_kokoro g2p = get_g2p("en-us") tokens = g2p("Hello world!") phoneme_str = " ".join(t.phonemes for t in tokens if t.phonemes) # Validate for Kokoro is_valid = validate_for_kokoro(phoneme_str) # Filter to keep only valid Kokoro phonemes filtered = filter_for_kokoro(phoneme_str) print(filtered) Multilang Preprocessing ------------------------ Use ``preprocess_multilang`` to get language override spans for mixed-language text. This integrates with the span-based phonemization API. .. code-block:: python from kokorog2p import phonemize from kokorog2p.multilang import preprocess_multilang text = "Hello, mein Freund! Bonjour!" overrides = preprocess_multilang( text, default_language="de", allowed_languages=["de", "en-us", "fr"], confidence_threshold=0.6, ) result = phonemize(text, language="de", overrides=overrides) Confidence Tuning ~~~~~~~~~~~~~~~~~ Adjust detection sensitivity based on your use case: .. code-block:: python from kokorog2p.multilang import preprocess_multilang text = "Das Meeting ist wichtig" conservative = preprocess_multilang( text, default_language="de", allowed_languages=["de", "en-us"], confidence_threshold=0.9, ) aggressive = preprocess_multilang( text, default_language="de", allowed_languages=["de", "en-us"], confidence_threshold=0.5, ) Integration with Span API ~~~~~~~~~~~~~~~~~~~~~~~~~~ Combine language detection with other span overrides: .. code-block:: python from kokorog2p import phonemize, OverrideSpan from kokorog2p.multilang import preprocess_multilang text = "Das Meeting ist wichtig" # Get language overrides lang_overrides = preprocess_multilang( text, default_language="de", allowed_languages=["de", "en-us"], ) # Add custom phoneme override all_overrides = lang_overrides + [ OverrideSpan(4, 11, {"ph": "ˈmiːtɪŋ"}) # Custom pronunciation for "Meeting" ] result = phonemize(text, language="de", overrides=all_overrides) Error Handling -------------- kokorog2p provides robust error handling to help you debug issues, especially in CI/CD environments. Strict Mode (Default) ~~~~~~~~~~~~~~~~~~~~~ By default, kokorog2p uses **strict mode** (``strict=True``), which raises clear exceptions when backend initialization or phonemization fails: .. code-block:: python from kokorog2p import get_g2p # Strict mode is the default g2p = get_g2p("en-us", backend="espeak", strict=True) try: result = g2p.phonemize("test") except RuntimeError as e: # Get detailed error message about what went wrong print(f"Error: {e}") # Example: "Espeak backend validation failed. Please ensure espeak-ng # is properly installed and voice 'en-us' is available." **Benefits of strict mode:** * Catches configuration issues immediately * Provides actionable error messages * Prevents silent failures in CI/CD pipelines * Recommended for production use Lenient Mode (Backward Compatible) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For backward compatibility with older versions (< 0.4.0) that silently failed, you can use **lenient mode** (``strict=False``): .. code-block:: python from kokorog2p import get_g2p # Lenient mode logs errors but doesn't raise exceptions g2p = get_g2p("en-us", backend="espeak", strict=False) result = g2p.phonemize("test") # If espeak fails: # - Error is logged to Python's logging system # - Returns empty string "" instead of raising exception # - Allows your application to continue running **When to use lenient mode:** * Migrating from older versions (< 0.4.0) * Non-critical applications where empty results are acceptable * When you have your own error handling logic Common Error Scenarios ~~~~~~~~~~~~~~~~~~~~~~ **espeak-ng not installed:** .. code-block:: python # Strict mode (default) g2p = get_g2p("en-us", backend="espeak") # RuntimeError: Espeak backend validation failed. Please ensure espeak-ng # is properly installed... # Solution: Install espeak-ng # Ubuntu/Debian: sudo apt-get install espeak-ng # macOS: brew install espeak # Windows: Download from https://github.com/espeak-ng/espeak-ng/releases **Invalid voice:** .. code-block:: python from kokorog2p.espeak_g2p import EspeakOnlyG2P g2p = EspeakOnlyG2P(language="xx-invalid") # RuntimeError: Espeak backend validation failed...voice 'xx-invalid' is unavailable **CI/CD Best Practices:** .. code-block:: python import logging # Configure logging to see error details logging.basicConfig(level=logging.INFO) # Use strict mode in CI to catch issues early (this is the default) g2p = get_g2p("en-us", backend="espeak", strict=True) # Your CI will fail with clear error messages if there are issues **Handling missing dependencies:** .. code-block:: python from kokorog2p import get_g2p try: # This might fail if Chinese dependencies not installed g2p = get_g2p("zh") tokens = g2p("你好") except ImportError as e: print(f"Missing dependency: {e}") print("Install with: pip install kokorog2p[zh]") try: # This might fail if spaCy model not downloaded g2p = get_g2p("en-us", use_spacy=True) except OSError as e: print("spaCy model not found") print("Download with: python -m spacy download en_core_web_md") Configuring with Different Backends ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The ``strict`` parameter works with all backends: .. code-block:: python from kokorog2p import get_g2p # Espeak backend with strict mode g2p_espeak = get_g2p("en-us", backend="espeak", strict=True) # Goruut backend with strict mode g2p_goruut = get_g2p("en-us", backend="goruut", strict=True) # Dictionary-based with fallback (strict controls fallback/init errors) g2p_dict = get_g2p( "en-us", backend="kokorog2p", use_espeak_fallback=True, strict=True # Affects fallback initialization and errors ) Next Steps ---------- * See :doc:`api/core` for detailed API reference * Check :doc:`languages` for language-specific features * Read :doc:`phonemes` to understand the phoneme inventory