Language Support ================ kokorog2p supports multiple languages with varying levels of functionality. .. list-table:: Language Support Overview :header-rows: 1 :widths: 15 15 20 20 30 * - Language - Code - Dictionary - Fallback - Special Features * - English (US) - en-us - 100k+ entries - espeak-ng - POS tagging, stress, numbers * - English (GB) - en-gb - 100k+ entries - espeak-ng - POS tagging, stress, numbers * - German - de - 738k+ entries - espeak-ng - Phonological rules, numbers * - French - fr - Gold dictionary - espeak-ng - Numbers, liaison rules * - Spanish - es - Rule-based - espeak-ng/goruut - Phonological rules, numbers * - Italian - it - Rule-based - espeak-ng/goruut - Phonological rules, gemination * - Portuguese - pt - Rule-based - — - Phonological rules, nasalization * - Czech - cs - Rule-based - espeak-ng/goruut - Phonological rules * - Chinese - zh - — - pypinyin - Tone sandhi, pinyin * - Japanese - ja - — - pyopenjtalk - Mora-based, pitch accent * - Korean - ko - — - MeCab - Phonological rules, liaison * - Hebrew - he - — - phonikud - Nikud handling, stress * - Mixed - multilingual - Auto-detect - lingua-py - 17+ languages, word-level detection English (en-us, en-gb) ---------------------- English G2P uses a two-tier dictionary system with spaCy for POS tagging. Features ~~~~~~~~ * **Gold dictionary**: 50k+ high-confidence entries * **Silver dictionary**: Additional 50k+ entries * **POS-aware pronunciation**: Different pronunciations based on part of speech * **Stress assignment**: Primary and secondary stress markers * **Number handling**: Cardinals, ordinals, currency * **Contraction support**: Proper handling of "can't", "won't", etc. Usage ~~~~~ .. code-block:: python from kokorog2p.en import EnglishG2P # US English g2p_us = EnglishG2P( language="en-us", use_espeak_fallback=True, use_spacy=True, spacy_model="en_core_web_md", # default ) # British English g2p_gb = EnglishG2P( language="en-gb", use_espeak_fallback=True, use_spacy=True, spacy_model="en_core_web_md", # default ) # Optional: select a different spaCy English model g2p_sm = EnglishG2P(language="en-us", use_spacy=True, spacy_model="en_core_web_sm") Examples ~~~~~~~~ .. code-block:: python from kokorog2p import phonemize # Context-dependent pronunciation print(phonemize("I read a book.", language="en-us")) # → ˈaɪ ɹˈɛd ə bˈʊk. print(phonemize("I will read tomorrow.", language="en-us")) # → ˈaɪ wɪl ɹˈid təmˈɑɹO. # Numbers and currency print(phonemize("I paid $1,234.56 for it.", language="en-us")) # → aɪ pˈeɪd wʌn θˈaʊzənd tˈu hˈʌndɹəd... German (de) ----------- German G2P uses a large dictionary (738k+ entries from Olaph) with rule-based fallback. Features ~~~~~~~~ * **Large dictionary**: 738k+ entries with stress markers * **Phonological rules**: - Final obstruent devoicing (Auslautverhärtung) - ich-Laut [ç] vs ach-Laut [x] alternation - Word-initial sp/st → [ʃp]/[ʃt] - Vowel length rules - Schwa in unstressed syllables * **Number handling**: Cardinals, ordinals, years, currency * **Regional variants**: de-de, de-at, de-ch Usage ~~~~~ .. code-block:: python from kokorog2p.de import GermanG2P g2p = GermanG2P( language="de-de", use_espeak_fallback=True, strip_stress=True ) Examples ~~~~~~~~ .. code-block:: python from kokorog2p import phonemize # Basic phonemization print(str(phonemize("Guten Tag", language="de"))) # → ɡuːtn̩ taːk # Phonological rules print(str(phonemize("ich", language="de"))) # → ɪç (ich-Laut) print(str(phonemize("ach", language="de"))) # → ax (ach-Laut) print(str(phonemize("Tag", language="de"))) # → taːk (final devoicing) # Numbers print(str(phonemize("Ich habe 42 Euro.", language="de"))) # → ɪç haːbə t͡svaɪ̯ʊntfɪɐ̯t͡sɪç ɔɪ̯ʁo. French (fr) ----------- French G2P uses a gold dictionary with espeak-ng fallback. Features ~~~~~~~~ * **Gold dictionary**: High-quality French pronunciations * **Number handling**: Cardinals, ordinals, currency * **espeak-ng fallback**: For out-of-vocabulary words Usage ~~~~~ .. code-block:: python from kokorog2p.fr import FrenchG2P g2p = FrenchG2P( language="fr-fr", use_espeak_fallback=True ) Examples ~~~~~~~~ .. code-block:: python from kokorog2p import phonemize print(phonemize("Bonjour le monde", language="fr")) # → bɔ̃ʒuʁ lə mɔ̃d print(phonemize("J'ai vingt et un ans.", language="fr")) # → ʒɛ vɛ̃t e œ̃ ɑ̃. Czech (cs) ---------- Czech G2P is entirely rule-based with comprehensive phonological rules. Features ~~~~~~~~ * **Rule-based phonology**: - Palatalization (d+i → ɟ, t+i → c, n+i → ɲ) - Long vowels (á → aː, í → iː, etc.) - ř phoneme [r̝] - ch digraph → [x] - Final devoicing - Voicing assimilation * **No dictionary required**: Works with any Czech text Usage ~~~~~ .. code-block:: python from kokorog2p.cs import CzechG2P g2p = CzechG2P(language="cs-cz") Examples ~~~~~~~~ .. code-block:: python from kokorog2p import phonemize print(phonemize("Dobrý den", language="cs")) # → dobriː dɛn print(phonemize("Praha", language="cs")) # → praɦa # Palatalization print(phonemize("děti", language="cs")) # → ɟɛcɪ # ř phoneme print(phonemize("řeka", language="cs")) # → r̝ɛka Spanish (es) ------------ Spanish G2P is rule-based with comprehensive phonological rules for both European and Latin American dialects. Features ~~~~~~~~ * **Rule-based phonology**: - 5 pure vowels (a, e, i, o, u) - Stress prediction (penultimate for vowel-ending, final for consonant-ending) - Palatal sounds: ñ [ɲ], ll [ʎ] or [j] - Jota: j/g+e/i [x] - Theta: z/c+e/i [θ] (European) or [s] (Latin American) - Tap vs trill: r [ɾ] vs rr [r] * **Dialect support**: es (European), la (Latin American) * **Number handling**: Cardinals, ordinals, currency Usage ~~~~~ .. code-block:: python from kokorog2p.es import SpanishG2P g2p = SpanishG2P( language="es", dialect="es" # or "la" for Latin American ) Examples ~~~~~~~~ .. code-block:: python from kokorog2p import phonemize print(phonemize("Hola mundo", language="es")) # → ola mundo # Phonological features print(phonemize("año", language="es")) # → aɲo print(phonemize("calle", language="es")) # → kaʎe or kaje print(phonemize("perro", language="es")) # → pero (trilled r) Italian (it) ------------ Italian G2P uses rule-based phonology with predictable stress and gemination handling. Features ~~~~~~~~ * **Rule-based phonology**: - 5 pure vowels (a, e, i, o, u) - no reduction - Predictable stress (usually penultimate) - Gemination (double consonants) preservation - Palatals: gn [ɲ], gli [ʎ] - Affricates: z [ʦ/ʣ], c/ci [ʧ], g/gi [ʤ] - Context-sensitive c/g pronunciation * **Stress marking**: Automatic stress detection from accents * **Number handling**: Cardinals, ordinals Usage ~~~~~ .. code-block:: python from kokorog2p.it import ItalianG2P g2p = ItalianG2P( language="it-it", mark_stress=True, mark_gemination=True ) Examples ~~~~~~~~ .. code-block:: python from kokorog2p import phonemize print(phonemize("Ciao mondo", language="it")) # → ʧao mondo # Gemination print(phonemize("anno", language="it")) # → anːo print(phonemize("fatto", language="it")) # → fatːo # Palatals print(phonemize("gnocchi", language="it")) # → ɲɔkːi print(phonemize("figlio", language="it")) # → fiʎo Portuguese (pt) --------------- Portuguese G2P supports Brazilian Portuguese with comprehensive phonological rules. Features ~~~~~~~~ * **Rule-based phonology**: - 7 oral vowels (a, e, ɛ, i, o, ɔ, u) - 5 nasal vowels (ã, ẽ, ĩ, õ, ũ) - Nasal diphthongs - Palatalization: lh [ʎ], nh [ɲ], x/ch [ʃ] - Affrication: t+i [ʧ], d+i [ʤ] (Brazilian) - Sibilants: s [s/z], x [ʃ], z [z] - Liquids: r [ʁ/x/h], rr [ʁ/x], single r [ɾ] * **Dialect**: Brazilian Portuguese (pt-br) * **Stress marking**: Automatic stress assignment Usage ~~~~~ .. code-block:: python from kokorog2p.pt import PortugueseG2P g2p = PortugueseG2P( language="pt-br", mark_stress=True, affricate_ti_di=True # Brazilian feature ) Examples ~~~~~~~~ .. code-block:: python from kokorog2p import phonemize print(phonemize("Olá mundo", language="pt")) # → ola mundo # Nasal vowels print(phonemize("mãe", language="pt")) # → mãj̃ print(phonemize("pão", language="pt")) # → pãw̃ # Affrication (Brazilian) print(phonemize("tia", language="pt")) # → ʧia print(phonemize("dia", language="pt")) # → ʤia Chinese (zh) ------------ Chinese G2P uses jieba for tokenization and pypinyin for phoneme conversion. Features ~~~~~~~~ * **Jieba tokenization**: Chinese word segmentation * **Pypinyin conversion**: Pinyin to IPA * **Tone sandhi**: Automatic tone changes * **cn2an**: Number to Chinese conversion * **Punctuation mapping**: Chinese to Western punctuation Usage ~~~~~ .. code-block:: python from kokorog2p.zh import ChineseG2P g2p = ChineseG2P( language="zh", version="1.1" ) Examples ~~~~~~~~ .. code-block:: python from kokorog2p import phonemize print(phonemize("你好世界", language="zh")) # → nǐ hǎo shì jiè (with tone markers) Japanese (ja) ------------- Japanese G2P uses pyopenjtalk for text analysis and mora-based phoneme generation. Features ~~~~~~~~ * **pyopenjtalk**: Full Japanese text analysis * **Mora-based**: Phonemes aligned with mora structure * **Pitch accent**: Automatic pitch accent assignment * **Number handling**: Japanese numerals Usage ~~~~~ .. code-block:: python from kokorog2p.ja import JapaneseG2P g2p = JapaneseG2P( language="ja", version="pyopenjtalk" ) Examples ~~~~~~~~ .. code-block:: python from kokorog2p import phonemize print(phonemize("こんにちは", language="ja")) # → koɴɲit͡ɕiha print(phonemize("世界", language="ja")) # → sekai Korean (ko) ----------- Korean G2P uses MeCab-based morphological analysis with comprehensive phonological rules. Features ~~~~~~~~ * **MeCab integration**: Korean morphological analysis * **Phonological rules**: - Consonant assimilation - Palatalization - Tensification - Aspiration - Liaison (연음) - Final consonant neutralization * **Hanja support**: Sino-Korean character handling * **Number handling**: Korean numerals Usage ~~~~~ .. code-block:: python from kokorog2p.ko import KoreanG2P g2p = KoreanG2P( language="ko-kr", use_mecab=True ) Examples ~~~~~~~~ .. code-block:: python from kokorog2p import phonemize print(phonemize("안녕하세요", language="ko")) # → annjʌŋhasejo # Phonological rules print(phonemize("학교", language="ko")) # → hakk͈jo (tensification) print(phonemize("받침", language="ko")) # → patʃʰim (palatalization) Hebrew (he) ----------- Hebrew G2P uses phonikud for nikud-based phonemization. Features ~~~~~~~~ * **phonikud integration**: Hebrew nikud to IPA conversion * **Nikud handling**: Processes diacritical marks for vowels * **Stress prediction**: Automatic stress assignment * **Modern Hebrew**: Optimized for contemporary pronunciation Usage ~~~~~ .. code-block:: python from kokorog2p.he import HebrewG2P g2p = HebrewG2P( language="he-il", preserve_punctuation=True, preserve_stress=True ) Examples ~~~~~~~~ .. code-block:: python from kokorog2p import phonemize # Requires nikud (diacritical marks) print(phonemize("שָׁלוֹם", language="he")) # → ʃalom print(phonemize("עִבְרִית", language="he")) # → ivʁit Mixed-Language Support ---------------------- kokorog2p can automatically detect and handle texts that mix multiple languages, routing each word to the appropriate G2P engine. Features ~~~~~~~~ * **Automatic detection**: Word-level language detection using lingua-py * **High accuracy**: >90% accuracy for words with 5+ characters * **Caching**: Detection results cached for performance * **Configurable threshold**: Control detection sensitivity * **Graceful degradation**: Falls back to primary language without lingua-py * **17+ languages**: Support for major world languages Supported Languages ~~~~~~~~~~~~~~~~~~~ * English (en-us, en-gb) * German (de) * French (fr) * Spanish (es) * Italian (it) * Portuguese (pt) * Japanese (ja) * Chinese (zh) * Korean (ko) * Hebrew (he) * Czech (cs) * Dutch (nl) * Polish (pl) * Russian (ru) * Arabic (ar) * Hindi (hi) * Turkish (tr) Usage ~~~~~ .. code-block:: python from kokorog2p import phonemize from kokorog2p.multilang import preprocess_multilang text = "Das Meeting war great!" overrides = preprocess_multilang( text, default_language="de", allowed_languages=["de", "en-us"], ) result = phonemize(text, lang="de", overrides=overrides, result_type="result") Examples ~~~~~~~~ **German with English:** .. code-block:: python from kokorog2p import phonemize from kokorog2p.multilang import preprocess_multilang text = "Ich gehe zum Meeting. Let's discuss the Roadmap!" overrides = preprocess_multilang( text, default_language="de", allowed_languages=["de", "en-us"], ) result = phonemize(text, lang="de", overrides=overrides, result_type="result") print(result.phonemes) **English with German:** .. code-block:: python overrides = preprocess_multilang( "Hello, mein Freund! This is wunderbar.", default_language="en-us", allowed_languages=["en-us", "de"], ) result = phonemize( "Hello, mein Freund! This is wunderbar.", language="en-us", overrides=overrides) ) print(result.phonemes) **Multiple languages:** .. code-block:: python overrides = preprocess_multilang( "Bonjour! The Meeting ist wichtig.", default_language="fr", allowed_languages=["fr", "en-us", "de"], ) result = phonemize( "Bonjour! The Meeting ist wichtig.", language="fr", overrides=overrides, ) print(result.phonemes) Configuration ~~~~~~~~~~~~~ **Confidence threshold:** .. code-block:: python from kokorog2p.multilang import preprocess_multilang # Conservative (higher confidence required) overrides = preprocess_multilang( "Das Meeting ist wichtig", default_language="de", allowed_languages=["de", "en-us"], confidence_threshold=0.9, # Default: 0.7 ) # Aggressive (lower confidence required) overrides = preprocess_multilang( "Das Meeting ist wichtig", default_language="de", allowed_languages=["de", "en-us"], confidence_threshold=0.5, ) How It Works ~~~~~~~~~~~~ 1. Text is tokenized into words 2. Each word is sent to the language detector 3. Detector returns language + confidence score 4. If confidence ≥ threshold and language is allowed: * An ``OverrideSpan`` is created with ``{"lang": "..."}`` * Short words (<3 chars) keep the default language Performance ~~~~~~~~~~~ * **Memory**: ~100 MB for lingua models (loaded once) * **Speed**: ~0.1-0.5 ms per word * **Accuracy**: >90% for words with 5+ characters Limitations ~~~~~~~~~~~ * Short words (<3 characters) use the default language only * Proper nouns may be misdetected * Requires ``lingua-language-detector`` installation * Detection quality varies by word distinctiveness Installation ~~~~~~~~~~~~ .. code-block:: bash pip install kokorog2p[mixed] Language-Specific Number Handling ---------------------------------- English ~~~~~~~ .. code-block:: python from kokorog2p.en.numbers import expand_number print(expand_number("I have $42.50")) # → I have forty-two dollars and fifty cents German ~~~~~~ .. code-block:: python from kokorog2p.de.numbers import expand_number print(expand_number("Ich habe 42 Euro.")) # → Ich habe zweiundvierzig Euro. French ~~~~~~ .. code-block:: python from kokorog2p.fr.numbers import expand_number print(expand_number("J'ai 42 euros.")) # → J'ai quarante-deux euros. Fallback Languages ------------------ For languages not explicitly supported, kokorog2p falls back to espeak-ng: .. code-block:: python from kokorog2p import get_g2p # Spanish (uses espeak-ng) g2p_es = get_g2p("es-es") # Italian (uses espeak-ng) g2p_it = get_g2p("it-it") # Portuguese (uses espeak-ng) g2p_pt = get_g2p("pt-br") This provides basic support for 100+ languages via espeak-ng. Next Steps ---------- * See :doc:`advanced` for advanced usage patterns * Check language-specific API docs: - :doc:`api/english` - :doc:`api/german` - :doc:`api/french` - :doc:`api/czech` - :doc:`api/spanish` - :doc:`api/italian` - :doc:`api/portuguese` - :doc:`api/chinese` - :doc:`api/japanese` - :doc:`api/korean` - :doc:`api/hebrew` - :doc:`api/mixed`