Advanced Usage
This guide covers advanced features and usage patterns for kokorog2p.
Custom G2P Configuration
Memory-Efficient Loading
Control dictionary loading to optimize memory and initialization time:
from kokorog2p import get_g2p
# Default: Gold + Silver dictionaries (~365k entries, ~57 MB)
# Provides maximum vocabulary coverage
g2p = get_g2p("en-us")
# Memory-optimized: Gold dictionary only (~179k entries, ~35 MB)
# Saves ~22-31 MB memory and ~400-470 ms initialization time
g2p_fast = get_g2p("en-us", load_silver=False)
# Ultra-fast initialization: No dictionaries (~7 MB, espeak fallback only)
# Saves ~50+ MB memory, fastest initialization
g2p_minimal = get_g2p("en-us", load_silver=False, load_gold=False)
# Check dictionary size
print(f"Gold entries: {len(g2p.lexicon.golds):,}")
print(f"Silver entries: {len(g2p.lexicon.silvers):,}")
Dictionary loading configurations:
load_gold=True, load_silver=True: Maximum coverage (default, ~365k entries)load_gold=True, load_silver=False: Common words only (~179k entries, -22-31 MB)load_gold=False, load_silver=True: Extended vocabulary only (unusual, ~187k entries)load_gold=False, load_silver=False: Ultra-fast (espeak only, -50+ MB)
When to disable dictionaries:
Disable silver (
load_silver=False): * Resource-constrained environments (limited memory) * Real-time applications (faster initialization) * You only need common vocabulary * Production deployments where performance is criticalDisable both (
load_gold=False, load_silver=False): * Ultra-fast initialization is critical * You’re fine with espeak-only fallback * Minimal memory footprint required * Testing or prototyping
Default (both enabled) provides:
Maximum vocabulary coverage (~365k total entries)
Best phoneme quality from curated dictionaries
Backward compatibility with existing code
Disabling Features
You can disable specific features for better performance or control:
from kokorog2p.en import EnglishG2P
# Disable espeak fallback
g2p = EnglishG2P(
language="en-us",
use_espeak_fallback=False, # Unknown words will have no phonemes
use_spacy=True,
spacy_model="en_core_web_md", # default
)
# Disable spaCy (faster but no POS tagging)
g2p = EnglishG2P(
language="en-us",
use_espeak_fallback=True,
use_spacy=False # Faster tokenization
)
# Minimal configuration (fastest)
g2p = EnglishG2P(
language="en-us",
use_espeak_fallback=False,
use_spacy=False,
load_silver=False,
load_gold=False # No dictionaries, ultra-fast
)
spaCy Model Selection (English)
English G2P lets you choose which spaCy model to use for POS tagging. This affects
homograph and heteronym disambiguation quality (for example, lives noun vs verb).
from kokorog2p.en import EnglishG2P
# Default (recommended balance)
g2p_md = EnglishG2P(use_spacy=True, spacy_model="en_core_web_md")
# Smaller model (lower memory / faster downloads)
g2p_sm = EnglishG2P(use_spacy=True, spacy_model="en_core_web_sm")
# Largest model (highest spaCy English accuracy, highest memory)
g2p_lg = EnglishG2P(use_spacy=True, spacy_model="en_core_web_lg")
The same option is also available through get_g2p():
from kokorog2p import get_g2p
g2p = get_g2p("en-us", use_spacy=True, spacy_model="en_core_web_md")
Stress Control
Control stress marker output:
from kokorog2p.de import GermanG2P
# Strip stress markers from output
g2p = GermanG2P(
language="de-de",
strip_stress=True # Remove ˈ and ˌ markers
)
Token Inspection
Tokens contain detailed information:
from kokorog2p import get_g2p
g2p = get_g2p("en-us", use_spacy=True)
tokens = g2p("I can't believe it!")
for token in tokens:
# Basic attributes
print(f"Text: {token.text}")
print(f"Phonemes: {token.phonemes}")
print(f"POS tag: {token.tag}")
print(f"Whitespace: '{token.whitespace}'")
# Additional metadata
rating = token.get("rating") # 5=dictionary, 2=espeak, 0=unknown
print(f"Rating: {rating}")
# Check token type
is_punct = not any(c.isalnum() for c in token.text)
print(f"Is punctuation: {is_punct}")
Rating System
Tokens have a rating indicating the source of phonemes:
5: User-provided (via OverrideSpan) or gold dictionary (highest quality)
4: Punctuation
3: Silver dictionary or rule-based conversion
2: From espeak-ng fallback
1: From goruut backend
0: Unknown/failed
from kokorog2p import get_g2p
g2p = get_g2p("en-us")
tokens = g2p("Hello xyznotaword!")
for token in tokens:
rating = token.get("rating", 0)
if rating == 5:
print(f"{token.text}: High quality (gold dictionary)")
elif rating == 3:
print(f"{token.text}: Silver dictionary")
elif rating == 2:
print(f"{token.text}: Fallback (espeak)")
elif rating == 0:
print(f"{token.text}: Unknown")
Dictionary Lookup
Direct dictionary access:
from kokorog2p.en import EnglishG2P
# Load with or without silver dataset
g2p_gold = EnglishG2P(language="en-us", load_silver=False)
g2p_full = EnglishG2P(language="en-us", load_silver=True)
# Simple lookup
phonemes = g2p_gold.lexicon.lookup("hello")
print(phonemes) # həlˈO
# Check if word is in dictionary
if g2p_gold.lexicon.is_known("hello"):
print("Word is in gold dictionary")
# Get dictionary sizes
print(f"Gold: {len(g2p_gold.lexicon.golds):,} entries")
print(f"Silver: {len(g2p_full.lexicon.silvers):,} entries")
# POS-aware lookup
phonemes_verb = g2p_gold.lexicon.lookup("read", tag="VB") # ɹˈid (present)
phonemes_past = g2p_gold.lexicon.lookup("read", tag="VBD") # ɹˈɛd (past)
German Lexicon
from kokorog2p.de import GermanLexicon
lexicon = GermanLexicon(strip_stress=False)
phonemes = lexicon.lookup("Haus")
print(phonemes) # haʊ̯s
print(f"Dictionary has {len(lexicon):,} entries") # 738,427
Phoneme Utilities
Validation
Validate phonemes against Kokoro vocabulary:
from kokorog2p import validate_phonemes, get_vocab
# Check if phonemes are valid
valid = validate_phonemes("hˈɛlO")
print(valid) # True
invalid = validate_phonemes("xyz123")
print(invalid) # False
# Get the full vocabulary
vocab = get_vocab("us")
print(f"US vocabulary: {len(vocab)} phonemes")
Conversion
Convert between different phoneme formats:
from kokorog2p import from_espeak, to_espeak
# Convert espeak IPA to Kokoro
espeak_ipa = "həlˈəʊ"
kokoro_phonemes = from_espeak(espeak_ipa, variant="us")
print(kokoro_phonemes) # hˈɛlO
# Convert Kokoro to espeak IPA
kokoro = "hˈɛlO"
espeak = to_espeak(kokoro, variant="us")
print(espeak)
Vocabulary Encoding
Convert phonemes to IDs for model input:
from kokorog2p import phonemes_to_ids, ids_to_phonemes
# Encode phonemes
phonemes = "hˈɛlO wˈɜɹld"
ids = phonemes_to_ids(phonemes)
print(ids) # [12, 45, 23, ...]
# Decode back
decoded = ids_to_phonemes(ids)
print(decoded) # hˈɛlO wˈɜɹld
# Get Kokoro vocabulary
from kokorog2p import get_kokoro_vocab
vocab = get_kokoro_vocab()
print(f"Kokoro has {len(vocab)} tokens")
Quote Handling
kokorog2p provides sophisticated quote handling with support for nested quotes and automatic conversion to curly quotes.
Nested Quote Detection
The tokenizer supports two modes for handling quotes:
from kokorog2p import get_g2p
# Default: Bracket-matching mode (supports nesting)
g2p = get_g2p("en-us")
tokens = g2p('He said "She used `backticks` here"')
# Check quote depths
for token in tokens:
depth = token.quote_depth
print(f"{token.text}: depth={depth}")
# Output shows nesting: "=1, `=2, `=2, "=1
Bracket-Matching Mode (default):
Supports nested quotes when using different quote characters
Maintains a stack to track nesting depth
Supported quote characters:
"(double quote),``(backtick),'(single quote)Depth increases with each level of nesting (1 = outermost, 2 = nested once, etc.)
Important: Nesting only works with different quote types:
✅ Supported:
"outer `inner` text"→ depths[1, 2, 2, 1](different quotes)❌ NOT supported:
"level1 "level2""→ depths[1, 1, 1, 1](same quotes alternate)
Examples:
from kokorog2p.pipeline.tokenizer import RegexTokenizer
# Create tokenizer with bracket matching (default)
tokenizer = RegexTokenizer(use_bracket_matching=True)
# Simple pair
tokens = tokenizer.tokenize('"hello"', '"hello"')
# Quote depths: [1, 1]
# Nested quotes (different types)
tokens = tokenizer.tokenize('"outer `inner` text"', '"outer `inner` text"')
# Quote depths: [1, 2, 2, 1]
# Multiple separate pairs
tokens = tokenizer.tokenize('"first" and "second"', '"first" and "second"')
# Quote depths: [1, 1, 1, 1]
# Triple nesting (different types)
tokens = tokenizer.tokenize('"a `b \'c\' d` e"', '"a `b \'c\' d` e"')
# Quote depths: [1, 2, 3, 3, 2, 1]
Simple Alternation Mode:
For simpler use cases without nesting support:
from kokorog2p.pipeline.tokenizer import RegexTokenizer
# Disable bracket matching for simple alternation
tokenizer = RegexTokenizer(use_bracket_matching=False)
# First quote opens (depth 1), second closes (depth 0)
tokens = tokenizer.tokenize('"hello" world', '"hello" world')
# Quote depths: [1, 0, 0]
Curly Quote Conversion
The tokenizer automatically converts straight quotes to curly quotes based on nesting depth:
from kokorog2p import get_g2p
g2p = get_g2p("en-us")
# Straight quotes converted to curly quotes
tokens = g2p('She said "hello"')
# First quote becomes left curly ("), last becomes right curly (")
quote_chars = [t.text for t in tokens if t.text in ('"', '"')]
print(quote_chars) # ['"', '"']
Conversion Rules:
Opening quotes (depth increases) → left curly quote
"(U+201C)Closing quotes (depth decreases) → right curly quote
"(U+201D)Backticks follow the same pattern as double quotes
Single quotes use standard apostrophe
'(U+0027)
Quote Depth in Custom Processing
Access quote depth for custom processing:
from kokorog2p import get_g2p
g2p = get_g2p("en-us")
tokens = g2p('He said "She whispered `quietly`"')
# Analyze quote nesting
for token in tokens:
if token.quote_depth > 0:
indent = " " * (token.quote_depth - 1)
print(f"{indent}[{token.quote_depth}] {token.text}")
Output shows nesting structure:
[1] "
[1] She
[1] whispered
[2] `
[2] quietly
[2] `
[1] "
Punctuation Handling
Automatic Normalization
kokorog2p automatically normalizes punctuation variants to ensure consistency with Kokoro TTS vocabulary:
from kokorog2p import get_g2p
g2p = get_g2p("en-us")
# Ellipsis variants → single ellipsis character (…)
tokens = g2p("Wait... really?") # ... → …
tokens = g2p("Wait. . . really?") # . . . → …
tokens = g2p("Wait.. really?") # .. → …
tokens = g2p("Wait…really?") # … preserved
# Dash variants → em dash (—)
tokens = g2p("Wait - what?") # spaced hyphen → em dash
tokens = g2p("Wait -- what?") # double hyphen → em dash
tokens = g2p("Wait – what?") # en dash → em dash
tokens = g2p("Wait — what?") # em dash preserved
tokens = g2p("Wait ― what?") # horizontal bar → em dash
tokens = g2p("Wait ‒ what?") # figure dash → em dash
tokens = g2p("Wait − what?") # minus sign → em dash
# Compound words preserve hyphens (no normalization)
tokens = g2p("well-known") # hyphen removed, words joined
tokens = g2p("state-of-the-art") # hyphens removed, words joined
Normalization Rules:
Ellipsis: All variants (
...,. . .,..,....) →…(U+2026)Em dash: All dash types when spaced (
-,--,–,—,―,‒,−) →—(U+2014)Hyphens in compound words: Preserved during tokenization, then removed in phoneme output
Apostrophes: All variants (
',',', ``, ``´, etc.) →'(U+0027)
Manual Normalization
Control punctuation normalization manually:
from kokorog2p import normalize_punctuation, filter_punctuation
# Normalize to Kokoro punctuation
text = "Hello... world!!!"
normalized = normalize_punctuation(text)
print(normalized) # Hello. world!
# Filter out non-Kokoro punctuation
phonemes = "hˈɛlO… wˈɜɹld‼"
filtered = filter_punctuation(phonemes)
print(filtered) # hˈɛlO. wˈɜɹld!
# Check if punctuation is valid
from kokorog2p import is_kokoro_punctuation
print(is_kokoro_punctuation("!")) # True
print(is_kokoro_punctuation("…")) # True (normalized automatically)
print(is_kokoro_punctuation("‼")) # False
Word Mismatch Detection
Detect mismatches between input text and phoneme output:
from kokorog2p import detect_mismatches
text = "Hello world!"
phonemes = "hɛlO wɜɹld !"
mismatches = detect_mismatches(text, phonemes)
for mismatch in mismatches:
print(f"Position {mismatch.position}:")
print(f" Input word: {mismatch.input_word}")
print(f" Output word: {mismatch.output_word}")
print(f" Type: {mismatch.type}")
Number Expansion
Customize number handling:
English
from kokorog2p.en.numbers import EnglishNumberConverter
converter = EnglishNumberConverter()
# Cardinals
print(converter.convert_cardinal("42"))
# → forty-two
# Ordinals
print(converter.convert_ordinal("42"))
# → forty-second
# Years
print(converter.convert_year("1984"))
# → nineteen eighty-four
# Currency
print(converter.convert_currency("12.50", "$"))
# → twelve dollars and fifty cents
# Decimals
print(converter.convert_decimal("3.14"))
# → three point one four
German
from kokorog2p.de.numbers import GermanNumberConverter
converter = GermanNumberConverter()
# Cardinals
print(converter.convert_cardinal("42"))
# → zweiundvierzig
# Ordinals
print(converter.convert_ordinal("42"))
# → zweiundvierzigste
# Years
print(converter.convert_year("1984"))
# → neunzehnhundertvierundachtzig
# Currency
print(converter.convert_currency("12,50", "€"))
# → zwölf Euro fünfzig
Custom Backend Selection
Choose specific backends:
from kokorog2p import get_g2p
# Use espeak backend
g2p_espeak = get_g2p("en-us", backend="espeak")
# Use goruut backend (if installed)
g2p_goruut = get_g2p("en-us", backend="goruut")
Direct Backend Access
from kokorog2p.backends.espeak import EspeakBackend
# Create espeak backend
backend = EspeakBackend(language="en-us")
# Phonemize a word
phonemes = backend.phonemize("hello")
print(phonemes)
Caching and Performance
Managing Cache
from kokorog2p import get_g2p, clear_cache
# G2P instances are cached by language and settings
g2p1 = get_g2p("en-us", use_spacy=True)
g2p2 = get_g2p("en-us", use_spacy=True)
assert g2p1 is g2p2 # Same instance
# Different settings = different cache entry
g2p3 = get_g2p("en-us", use_spacy=False)
assert g2p1 is not g2p3 # Different instance
# load_silver and load_gold also affect caching
g2p4 = get_g2p("en-us", load_silver=False)
assert g2p1 is not g2p4 # Different instance (different silver setting)
g2p5 = get_g2p("en-us", load_gold=False)
assert g2p1 is not g2p5 # Different instance (different gold setting)
# Clear cache when needed
clear_cache()
Batch Processing
For best performance when processing many texts:
from kokorog2p import get_g2p
# Create instance once
g2p = get_g2p("en-us")
texts = ["Hello", "World", "This", "Is", "Fast"]
# Process many texts with same instance
all_tokens = []
for text in texts:
tokens = g2p(text)
all_tokens.append(tokens)
Custom Phoneme Filtering
Filter phonemes for specific use cases:
from kokorog2p import get_g2p, validate_for_kokoro, filter_for_kokoro
g2p = get_g2p("en-us")
tokens = g2p("Hello world!")
phoneme_str = " ".join(t.phonemes for t in tokens if t.phonemes)
# Validate for Kokoro
is_valid = validate_for_kokoro(phoneme_str)
# Filter to keep only valid Kokoro phonemes
filtered = filter_for_kokoro(phoneme_str)
print(filtered)
Multilang Preprocessing
Use preprocess_multilang to get language override spans for mixed-language text.
This integrates with the span-based phonemization API.
from kokorog2p import phonemize
from kokorog2p.multilang import preprocess_multilang
text = "Hello, mein Freund! Bonjour!"
overrides = preprocess_multilang(
text,
default_language="de",
allowed_languages=["de", "en-us", "fr"],
confidence_threshold=0.6,
)
result = phonemize(text, language="de", overrides=overrides)
Confidence Tuning
Adjust detection sensitivity based on your use case:
from kokorog2p.multilang import preprocess_multilang
text = "Das Meeting ist wichtig"
conservative = preprocess_multilang(
text,
default_language="de",
allowed_languages=["de", "en-us"],
confidence_threshold=0.9,
)
aggressive = preprocess_multilang(
text,
default_language="de",
allowed_languages=["de", "en-us"],
confidence_threshold=0.5,
)
Integration with Span API
Combine language detection with other span overrides:
from kokorog2p import phonemize, OverrideSpan
from kokorog2p.multilang import preprocess_multilang
text = "Das Meeting ist wichtig"
# Get language overrides
lang_overrides = preprocess_multilang(
text,
default_language="de",
allowed_languages=["de", "en-us"],
)
# Add custom phoneme override
all_overrides = lang_overrides + [
OverrideSpan(4, 11, {"ph": "ˈmiːtɪŋ"}) # Custom pronunciation for "Meeting"
]
result = phonemize(text, language="de", overrides=all_overrides)
Error Handling
kokorog2p provides robust error handling to help you debug issues, especially in CI/CD environments.
Strict Mode (Default)
By default, kokorog2p uses strict mode (strict=True), which raises clear exceptions when backend initialization or phonemization fails:
from kokorog2p import get_g2p
# Strict mode is the default
g2p = get_g2p("en-us", backend="espeak", strict=True)
try:
result = g2p.phonemize("test")
except RuntimeError as e:
# Get detailed error message about what went wrong
print(f"Error: {e}")
# Example: "Espeak backend validation failed. Please ensure espeak-ng
# is properly installed and voice 'en-us' is available."
Benefits of strict mode:
Catches configuration issues immediately
Provides actionable error messages
Prevents silent failures in CI/CD pipelines
Recommended for production use
Lenient Mode (Backward Compatible)
For backward compatibility with older versions (< 0.4.0) that silently failed, you can use lenient mode (strict=False):
from kokorog2p import get_g2p
# Lenient mode logs errors but doesn't raise exceptions
g2p = get_g2p("en-us", backend="espeak", strict=False)
result = g2p.phonemize("test")
# If espeak fails:
# - Error is logged to Python's logging system
# - Returns empty string "" instead of raising exception
# - Allows your application to continue running
When to use lenient mode:
Migrating from older versions (< 0.4.0)
Non-critical applications where empty results are acceptable
When you have your own error handling logic
Common Error Scenarios
espeak-ng not installed:
# Strict mode (default)
g2p = get_g2p("en-us", backend="espeak")
# RuntimeError: Espeak backend validation failed. Please ensure espeak-ng
# is properly installed...
# Solution: Install espeak-ng
# Ubuntu/Debian: sudo apt-get install espeak-ng
# macOS: brew install espeak
# Windows: Download from https://github.com/espeak-ng/espeak-ng/releases
Invalid voice:
from kokorog2p.espeak_g2p import EspeakOnlyG2P
g2p = EspeakOnlyG2P(language="xx-invalid")
# RuntimeError: Espeak backend validation failed...voice 'xx-invalid' is unavailable
CI/CD Best Practices:
import logging
# Configure logging to see error details
logging.basicConfig(level=logging.INFO)
# Use strict mode in CI to catch issues early (this is the default)
g2p = get_g2p("en-us", backend="espeak", strict=True)
# Your CI will fail with clear error messages if there are issues
Handling missing dependencies:
from kokorog2p import get_g2p
try:
# This might fail if Chinese dependencies not installed
g2p = get_g2p("zh")
tokens = g2p("你好")
except ImportError as e:
print(f"Missing dependency: {e}")
print("Install with: pip install kokorog2p[zh]")
try:
# This might fail if spaCy model not downloaded
g2p = get_g2p("en-us", use_spacy=True)
except OSError as e:
print("spaCy model not found")
print("Download with: python -m spacy download en_core_web_md")
Configuring with Different Backends
The strict parameter works with all backends:
from kokorog2p import get_g2p
# Espeak backend with strict mode
g2p_espeak = get_g2p("en-us", backend="espeak", strict=True)
# Goruut backend with strict mode
g2p_goruut = get_g2p("en-us", backend="goruut", strict=True)
# Dictionary-based with fallback (strict controls fallback/init errors)
g2p_dict = get_g2p(
"en-us",
backend="kokorog2p",
use_espeak_fallback=True,
strict=True # Affects fallback initialization and errors
)
Next Steps
See Core API for detailed API reference
Check Language Support for language-specific features
Read Phoneme Inventory to understand the phoneme inventory