Advanced Usage

This guide covers advanced features and usage patterns for kokorog2p.

Custom G2P Configuration

Memory-Efficient Loading

Control dictionary loading to optimize memory and initialization time:

from kokorog2p import get_g2p

# Default: Gold + Silver dictionaries (~365k entries, ~57 MB)
# Provides maximum vocabulary coverage
g2p = get_g2p("en-us")

# Memory-optimized: Gold dictionary only (~179k entries, ~35 MB)
# Saves ~22-31 MB memory and ~400-470 ms initialization time
g2p_fast = get_g2p("en-us", load_silver=False)

# Ultra-fast initialization: No dictionaries (~7 MB, espeak fallback only)
# Saves ~50+ MB memory, fastest initialization
g2p_minimal = get_g2p("en-us", load_silver=False, load_gold=False)

# Check dictionary size
print(f"Gold entries: {len(g2p.lexicon.golds):,}")
print(f"Silver entries: {len(g2p.lexicon.silvers):,}")

Dictionary loading configurations:

  • load_gold=True, load_silver=True: Maximum coverage (default, ~365k entries)

  • load_gold=True, load_silver=False: Common words only (~179k entries, -22-31 MB)

  • load_gold=False, load_silver=True: Extended vocabulary only (unusual, ~187k entries)

  • load_gold=False, load_silver=False: Ultra-fast (espeak only, -50+ MB)

When to disable dictionaries:

  • Disable silver (load_silver=False): * Resource-constrained environments (limited memory) * Real-time applications (faster initialization) * You only need common vocabulary * Production deployments where performance is critical

  • Disable both (load_gold=False, load_silver=False): * Ultra-fast initialization is critical * You’re fine with espeak-only fallback * Minimal memory footprint required * Testing or prototyping

Default (both enabled) provides:

  • Maximum vocabulary coverage (~365k total entries)

  • Best phoneme quality from curated dictionaries

  • Backward compatibility with existing code

Disabling Features

You can disable specific features for better performance or control:

from kokorog2p.en import EnglishG2P

# Disable espeak fallback
g2p = EnglishG2P(
    language="en-us",
    use_espeak_fallback=False,  # Unknown words will have no phonemes
    use_spacy=True,
    spacy_model="en_core_web_md",  # default
)

# Disable spaCy (faster but no POS tagging)
g2p = EnglishG2P(
    language="en-us",
    use_espeak_fallback=True,
    use_spacy=False  # Faster tokenization
)

# Minimal configuration (fastest)
g2p = EnglishG2P(
    language="en-us",
    use_espeak_fallback=False,
    use_spacy=False,
    load_silver=False,
    load_gold=False  # No dictionaries, ultra-fast
)

spaCy Model Selection (English)

English G2P lets you choose which spaCy model to use for POS tagging. This affects homograph and heteronym disambiguation quality (for example, lives noun vs verb).

from kokorog2p.en import EnglishG2P

# Default (recommended balance)
g2p_md = EnglishG2P(use_spacy=True, spacy_model="en_core_web_md")

# Smaller model (lower memory / faster downloads)
g2p_sm = EnglishG2P(use_spacy=True, spacy_model="en_core_web_sm")

# Largest model (highest spaCy English accuracy, highest memory)
g2p_lg = EnglishG2P(use_spacy=True, spacy_model="en_core_web_lg")

The same option is also available through get_g2p():

from kokorog2p import get_g2p

g2p = get_g2p("en-us", use_spacy=True, spacy_model="en_core_web_md")

Stress Control

Control stress marker output:

from kokorog2p.de import GermanG2P

# Strip stress markers from output
g2p = GermanG2P(
    language="de-de",
    strip_stress=True  # Remove ˈ and ˌ markers
)

Token Inspection

Tokens contain detailed information:

from kokorog2p import get_g2p

g2p = get_g2p("en-us", use_spacy=True)
tokens = g2p("I can't believe it!")

for token in tokens:
    # Basic attributes
    print(f"Text: {token.text}")
    print(f"Phonemes: {token.phonemes}")
    print(f"POS tag: {token.tag}")
    print(f"Whitespace: '{token.whitespace}'")

    # Additional metadata
    rating = token.get("rating")  # 5=dictionary, 2=espeak, 0=unknown
    print(f"Rating: {rating}")

    # Check token type
    is_punct = not any(c.isalnum() for c in token.text)
    print(f"Is punctuation: {is_punct}")

Rating System

Tokens have a rating indicating the source of phonemes:

  • 5: User-provided (via OverrideSpan) or gold dictionary (highest quality)

  • 4: Punctuation

  • 3: Silver dictionary or rule-based conversion

  • 2: From espeak-ng fallback

  • 1: From goruut backend

  • 0: Unknown/failed

from kokorog2p import get_g2p

g2p = get_g2p("en-us")
tokens = g2p("Hello xyznotaword!")

for token in tokens:
    rating = token.get("rating", 0)
    if rating == 5:
        print(f"{token.text}: High quality (gold dictionary)")
    elif rating == 3:
        print(f"{token.text}: Silver dictionary")
    elif rating == 2:
        print(f"{token.text}: Fallback (espeak)")
    elif rating == 0:
        print(f"{token.text}: Unknown")

Dictionary Lookup

Direct dictionary access:

from kokorog2p.en import EnglishG2P

# Load with or without silver dataset
g2p_gold = EnglishG2P(language="en-us", load_silver=False)
g2p_full = EnglishG2P(language="en-us", load_silver=True)

# Simple lookup
phonemes = g2p_gold.lexicon.lookup("hello")
print(phonemes)  # həlˈO

# Check if word is in dictionary
if g2p_gold.lexicon.is_known("hello"):
    print("Word is in gold dictionary")

# Get dictionary sizes
print(f"Gold: {len(g2p_gold.lexicon.golds):,} entries")
print(f"Silver: {len(g2p_full.lexicon.silvers):,} entries")

# POS-aware lookup
phonemes_verb = g2p_gold.lexicon.lookup("read", tag="VB")   # ɹˈid (present)
phonemes_past = g2p_gold.lexicon.lookup("read", tag="VBD")  # ɹˈɛd (past)

German Lexicon

from kokorog2p.de import GermanLexicon

lexicon = GermanLexicon(strip_stress=False)

phonemes = lexicon.lookup("Haus")
print(phonemes)  # haʊ̯s

print(f"Dictionary has {len(lexicon):,} entries")  # 738,427

Phoneme Utilities

Validation

Validate phonemes against Kokoro vocabulary:

from kokorog2p import validate_phonemes, get_vocab

# Check if phonemes are valid
valid = validate_phonemes("hˈɛlO")
print(valid)  # True

invalid = validate_phonemes("xyz123")
print(invalid)  # False

# Get the full vocabulary
vocab = get_vocab("us")
print(f"US vocabulary: {len(vocab)} phonemes")

Conversion

Convert between different phoneme formats:

from kokorog2p import from_espeak, to_espeak

# Convert espeak IPA to Kokoro
espeak_ipa = "həlˈəʊ"
kokoro_phonemes = from_espeak(espeak_ipa, variant="us")
print(kokoro_phonemes)  # hˈɛlO

# Convert Kokoro to espeak IPA
kokoro = "hˈɛlO"
espeak = to_espeak(kokoro, variant="us")
print(espeak)

Vocabulary Encoding

Convert phonemes to IDs for model input:

from kokorog2p import phonemes_to_ids, ids_to_phonemes

# Encode phonemes
phonemes = "hˈɛlO wˈɜɹld"
ids = phonemes_to_ids(phonemes)
print(ids)  # [12, 45, 23, ...]

# Decode back
decoded = ids_to_phonemes(ids)
print(decoded)  # hˈɛlO wˈɜɹld

# Get Kokoro vocabulary
from kokorog2p import get_kokoro_vocab
vocab = get_kokoro_vocab()
print(f"Kokoro has {len(vocab)} tokens")

Quote Handling

kokorog2p provides sophisticated quote handling with support for nested quotes and automatic conversion to curly quotes.

Nested Quote Detection

The tokenizer supports two modes for handling quotes:

from kokorog2p import get_g2p

# Default: Bracket-matching mode (supports nesting)
g2p = get_g2p("en-us")
tokens = g2p('He said "She used `backticks` here"')

# Check quote depths
for token in tokens:
    depth = token.quote_depth
    print(f"{token.text}: depth={depth}")
# Output shows nesting: "=1, `=2, `=2, "=1

Bracket-Matching Mode (default):

  • Supports nested quotes when using different quote characters

  • Maintains a stack to track nesting depth

  • Supported quote characters: " (double quote), `` (backtick), ' (single quote)

  • Depth increases with each level of nesting (1 = outermost, 2 = nested once, etc.)

Important: Nesting only works with different quote types:

  • Supported: "outer `inner` text" → depths [1, 2, 2, 1] (different quotes)

  • NOT supported: "level1 "level2"" → depths [1, 1, 1, 1] (same quotes alternate)

Examples:

from kokorog2p.pipeline.tokenizer import RegexTokenizer

# Create tokenizer with bracket matching (default)
tokenizer = RegexTokenizer(use_bracket_matching=True)

# Simple pair
tokens = tokenizer.tokenize('"hello"', '"hello"')
# Quote depths: [1, 1]

# Nested quotes (different types)
tokens = tokenizer.tokenize('"outer `inner` text"', '"outer `inner` text"')
# Quote depths: [1, 2, 2, 1]

# Multiple separate pairs
tokens = tokenizer.tokenize('"first" and "second"', '"first" and "second"')
# Quote depths: [1, 1, 1, 1]

# Triple nesting (different types)
tokens = tokenizer.tokenize('"a `b \'c\' d` e"', '"a `b \'c\' d` e"')
# Quote depths: [1, 2, 3, 3, 2, 1]

Simple Alternation Mode:

For simpler use cases without nesting support:

from kokorog2p.pipeline.tokenizer import RegexTokenizer

# Disable bracket matching for simple alternation
tokenizer = RegexTokenizer(use_bracket_matching=False)

# First quote opens (depth 1), second closes (depth 0)
tokens = tokenizer.tokenize('"hello" world', '"hello" world')
# Quote depths: [1, 0, 0]

Curly Quote Conversion

The tokenizer automatically converts straight quotes to curly quotes based on nesting depth:

from kokorog2p import get_g2p

g2p = get_g2p("en-us")

# Straight quotes converted to curly quotes
tokens = g2p('She said "hello"')

# First quote becomes left curly ("), last becomes right curly (")
quote_chars = [t.text for t in tokens if t.text in ('"', '"')]
print(quote_chars)  # ['"', '"']

Conversion Rules:

  • Opening quotes (depth increases) → left curly quote " (U+201C)

  • Closing quotes (depth decreases) → right curly quote " (U+201D)

  • Backticks follow the same pattern as double quotes

  • Single quotes use standard apostrophe ' (U+0027)

Quote Depth in Custom Processing

Access quote depth for custom processing:

from kokorog2p import get_g2p

g2p = get_g2p("en-us")
tokens = g2p('He said "She whispered `quietly`"')

# Analyze quote nesting
for token in tokens:
    if token.quote_depth > 0:
        indent = "  " * (token.quote_depth - 1)
        print(f"{indent}[{token.quote_depth}] {token.text}")

Output shows nesting structure:

[1] "
[1] She
[1] whispered
  [2] `
  [2] quietly
  [2] `
[1] "

Punctuation Handling

Automatic Normalization

kokorog2p automatically normalizes punctuation variants to ensure consistency with Kokoro TTS vocabulary:

from kokorog2p import get_g2p

g2p = get_g2p("en-us")

# Ellipsis variants → single ellipsis character (…)
tokens = g2p("Wait... really?")      # ... → …
tokens = g2p("Wait. . . really?")    # . . . → …
tokens = g2p("Wait.. really?")       # .. → …
tokens = g2p("Wait…really?")         # … preserved

# Dash variants → em dash (—)
tokens = g2p("Wait - what?")         # spaced hyphen → em dash
tokens = g2p("Wait -- what?")        # double hyphen → em dash
tokens = g2p("Wait – what?")         # en dash → em dash
tokens = g2p("Wait — what?")         # em dash preserved
tokens = g2p("Wait ― what?")         # horizontal bar → em dash
tokens = g2p("Wait ‒ what?")         # figure dash → em dash
tokens = g2p("Wait − what?")         # minus sign → em dash

# Compound words preserve hyphens (no normalization)
tokens = g2p("well-known")           # hyphen removed, words joined
tokens = g2p("state-of-the-art")     # hyphens removed, words joined

Normalization Rules:

  • Ellipsis: All variants (..., . . ., .., ....) → (U+2026)

  • Em dash: All dash types when spaced (-, --, , , , , ) → (U+2014)

  • Hyphens in compound words: Preserved during tokenization, then removed in phoneme output

  • Apostrophes: All variants (', ', ', ``, ``´, etc.) → ' (U+0027)

Manual Normalization

Control punctuation normalization manually:

from kokorog2p import normalize_punctuation, filter_punctuation

# Normalize to Kokoro punctuation
text = "Hello... world!!!"
normalized = normalize_punctuation(text)
print(normalized)  # Hello. world!

# Filter out non-Kokoro punctuation
phonemes = "hˈɛlO… wˈɜɹld‼"
filtered = filter_punctuation(phonemes)
print(filtered)  # hˈɛlO. wˈɜɹld!

# Check if punctuation is valid
from kokorog2p import is_kokoro_punctuation
print(is_kokoro_punctuation("!"))   # True
print(is_kokoro_punctuation("…"))   # True (normalized automatically)
print(is_kokoro_punctuation("‼"))   # False

Word Mismatch Detection

Detect mismatches between input text and phoneme output:

from kokorog2p import detect_mismatches

text = "Hello world!"
phonemes = "hɛlO wɜɹld !"

mismatches = detect_mismatches(text, phonemes)

for mismatch in mismatches:
    print(f"Position {mismatch.position}:")
    print(f"  Input word: {mismatch.input_word}")
    print(f"  Output word: {mismatch.output_word}")
    print(f"  Type: {mismatch.type}")

Number Expansion

Customize number handling:

English

from kokorog2p.en.numbers import EnglishNumberConverter

converter = EnglishNumberConverter()

# Cardinals
print(converter.convert_cardinal("42"))
# → forty-two

# Ordinals
print(converter.convert_ordinal("42"))
# → forty-second

# Years
print(converter.convert_year("1984"))
# → nineteen eighty-four

# Currency
print(converter.convert_currency("12.50", "$"))
# → twelve dollars and fifty cents

# Decimals
print(converter.convert_decimal("3.14"))
# → three point one four

German

from kokorog2p.de.numbers import GermanNumberConverter

converter = GermanNumberConverter()

# Cardinals
print(converter.convert_cardinal("42"))
# → zweiundvierzig

# Ordinals
print(converter.convert_ordinal("42"))
# → zweiundvierzigste

# Years
print(converter.convert_year("1984"))
# → neunzehnhundertvierundachtzig

# Currency
print(converter.convert_currency("12,50", "€"))
# → zwölf Euro fünfzig

Custom Backend Selection

Choose specific backends:

from kokorog2p import get_g2p

# Use espeak backend
g2p_espeak = get_g2p("en-us", backend="espeak")

# Use goruut backend (if installed)
g2p_goruut = get_g2p("en-us", backend="goruut")

Direct Backend Access

from kokorog2p.backends.espeak import EspeakBackend

# Create espeak backend
backend = EspeakBackend(language="en-us")

# Phonemize a word
phonemes = backend.phonemize("hello")
print(phonemes)

Caching and Performance

Managing Cache

from kokorog2p import get_g2p, clear_cache

# G2P instances are cached by language and settings
g2p1 = get_g2p("en-us", use_spacy=True)
g2p2 = get_g2p("en-us", use_spacy=True)
assert g2p1 is g2p2  # Same instance

# Different settings = different cache entry
g2p3 = get_g2p("en-us", use_spacy=False)
assert g2p1 is not g2p3  # Different instance

# load_silver and load_gold also affect caching
g2p4 = get_g2p("en-us", load_silver=False)
assert g2p1 is not g2p4  # Different instance (different silver setting)

g2p5 = get_g2p("en-us", load_gold=False)
assert g2p1 is not g2p5  # Different instance (different gold setting)

# Clear cache when needed
clear_cache()

Batch Processing

For best performance when processing many texts:

from kokorog2p import get_g2p

# Create instance once
g2p = get_g2p("en-us")

texts = ["Hello", "World", "This", "Is", "Fast"]

# Process many texts with same instance
all_tokens = []
for text in texts:
    tokens = g2p(text)
    all_tokens.append(tokens)

Custom Phoneme Filtering

Filter phonemes for specific use cases:

from kokorog2p import get_g2p, validate_for_kokoro, filter_for_kokoro

g2p = get_g2p("en-us")
tokens = g2p("Hello world!")

phoneme_str = " ".join(t.phonemes for t in tokens if t.phonemes)

# Validate for Kokoro
is_valid = validate_for_kokoro(phoneme_str)

# Filter to keep only valid Kokoro phonemes
filtered = filter_for_kokoro(phoneme_str)
print(filtered)

Multilang Preprocessing

Use preprocess_multilang to get language override spans for mixed-language text. This integrates with the span-based phonemization API.

from kokorog2p import phonemize
from kokorog2p.multilang import preprocess_multilang

text = "Hello, mein Freund! Bonjour!"
overrides = preprocess_multilang(
    text,
    default_language="de",
    allowed_languages=["de", "en-us", "fr"],
    confidence_threshold=0.6,
)

result = phonemize(text, language="de", overrides=overrides)

Confidence Tuning

Adjust detection sensitivity based on your use case:

from kokorog2p.multilang import preprocess_multilang

text = "Das Meeting ist wichtig"

conservative = preprocess_multilang(
    text,
    default_language="de",
    allowed_languages=["de", "en-us"],
    confidence_threshold=0.9,
)

aggressive = preprocess_multilang(
    text,
    default_language="de",
    allowed_languages=["de", "en-us"],
    confidence_threshold=0.5,
)

Integration with Span API

Combine language detection with other span overrides:

from kokorog2p import phonemize, OverrideSpan
from kokorog2p.multilang import preprocess_multilang

text = "Das Meeting ist wichtig"

# Get language overrides
lang_overrides = preprocess_multilang(
    text,
    default_language="de",
    allowed_languages=["de", "en-us"],
)

# Add custom phoneme override
all_overrides = lang_overrides + [
    OverrideSpan(4, 11, {"ph": "ˈmiːtɪŋ"})  # Custom pronunciation for "Meeting"
]

result = phonemize(text, language="de", overrides=all_overrides)

Error Handling

kokorog2p provides robust error handling to help you debug issues, especially in CI/CD environments.

Strict Mode (Default)

By default, kokorog2p uses strict mode (strict=True), which raises clear exceptions when backend initialization or phonemization fails:

from kokorog2p import get_g2p

# Strict mode is the default
g2p = get_g2p("en-us", backend="espeak", strict=True)

try:
    result = g2p.phonemize("test")
except RuntimeError as e:
    # Get detailed error message about what went wrong
    print(f"Error: {e}")
    # Example: "Espeak backend validation failed. Please ensure espeak-ng
    # is properly installed and voice 'en-us' is available."

Benefits of strict mode:

  • Catches configuration issues immediately

  • Provides actionable error messages

  • Prevents silent failures in CI/CD pipelines

  • Recommended for production use

Lenient Mode (Backward Compatible)

For backward compatibility with older versions (< 0.4.0) that silently failed, you can use lenient mode (strict=False):

from kokorog2p import get_g2p

# Lenient mode logs errors but doesn't raise exceptions
g2p = get_g2p("en-us", backend="espeak", strict=False)

result = g2p.phonemize("test")
# If espeak fails:
# - Error is logged to Python's logging system
# - Returns empty string "" instead of raising exception
# - Allows your application to continue running

When to use lenient mode:

  • Migrating from older versions (< 0.4.0)

  • Non-critical applications where empty results are acceptable

  • When you have your own error handling logic

Common Error Scenarios

espeak-ng not installed:

# Strict mode (default)
g2p = get_g2p("en-us", backend="espeak")
# RuntimeError: Espeak backend validation failed. Please ensure espeak-ng
# is properly installed...

# Solution: Install espeak-ng
# Ubuntu/Debian: sudo apt-get install espeak-ng
# macOS: brew install espeak
# Windows: Download from https://github.com/espeak-ng/espeak-ng/releases

Invalid voice:

from kokorog2p.espeak_g2p import EspeakOnlyG2P

g2p = EspeakOnlyG2P(language="xx-invalid")
# RuntimeError: Espeak backend validation failed...voice 'xx-invalid' is unavailable

CI/CD Best Practices:

import logging

# Configure logging to see error details
logging.basicConfig(level=logging.INFO)

# Use strict mode in CI to catch issues early (this is the default)
g2p = get_g2p("en-us", backend="espeak", strict=True)

# Your CI will fail with clear error messages if there are issues

Handling missing dependencies:

from kokorog2p import get_g2p

try:
    # This might fail if Chinese dependencies not installed
    g2p = get_g2p("zh")
    tokens = g2p("你好")
except ImportError as e:
    print(f"Missing dependency: {e}")
    print("Install with: pip install kokorog2p[zh]")

try:
    # This might fail if spaCy model not downloaded
    g2p = get_g2p("en-us", use_spacy=True)
except OSError as e:
    print("spaCy model not found")
    print("Download with: python -m spacy download en_core_web_md")

Configuring with Different Backends

The strict parameter works with all backends:

from kokorog2p import get_g2p

# Espeak backend with strict mode
g2p_espeak = get_g2p("en-us", backend="espeak", strict=True)

# Goruut backend with strict mode
g2p_goruut = get_g2p("en-us", backend="goruut", strict=True)

# Dictionary-based with fallback (strict controls fallback/init errors)
g2p_dict = get_g2p(
    "en-us",
    backend="kokorog2p",
    use_espeak_fallback=True,
    strict=True  # Affects fallback initialization and errors
)

Next Steps