Language Support
kokorog2p supports multiple languages with varying levels of functionality.
Language |
Code |
Dictionary |
Fallback |
Special Features |
|---|---|---|---|---|
English (US) |
en-us |
100k+ entries |
espeak-ng |
POS tagging, stress, numbers |
English (GB) |
en-gb |
100k+ entries |
espeak-ng |
POS tagging, stress, numbers |
German |
de |
738k+ entries |
espeak-ng |
Phonological rules, numbers |
French |
fr |
Gold dictionary |
espeak-ng |
Numbers, liaison rules |
Spanish |
es |
Rule-based |
espeak-ng/goruut |
Phonological rules, numbers |
Italian |
it |
Rule-based |
espeak-ng/goruut |
Phonological rules, gemination |
Portuguese |
pt |
Rule-based |
— |
Phonological rules, nasalization |
Czech |
cs |
Rule-based |
espeak-ng/goruut |
Phonological rules |
Chinese |
zh |
— |
pypinyin |
Tone sandhi, pinyin |
Japanese |
ja |
— |
pyopenjtalk |
Mora-based, pitch accent |
Korean |
ko |
— |
MeCab |
Phonological rules, liaison |
Hebrew |
he |
— |
phonikud |
Nikud handling, stress |
Mixed |
multilingual |
Auto-detect |
lingua-py |
17+ languages, word-level detection |
English (en-us, en-gb)
English G2P uses a two-tier dictionary system with spaCy for POS tagging.
Features
Gold dictionary: 50k+ high-confidence entries
Silver dictionary: Additional 50k+ entries
POS-aware pronunciation: Different pronunciations based on part of speech
Stress assignment: Primary and secondary stress markers
Number handling: Cardinals, ordinals, currency
Contraction support: Proper handling of “can’t”, “won’t”, etc.
Usage
from kokorog2p.en import EnglishG2P
# US English
g2p_us = EnglishG2P(
language="en-us",
use_espeak_fallback=True,
use_spacy=True,
spacy_model="en_core_web_md", # default
)
# British English
g2p_gb = EnglishG2P(
language="en-gb",
use_espeak_fallback=True,
use_spacy=True,
spacy_model="en_core_web_md", # default
)
# Optional: select a different spaCy English model
g2p_sm = EnglishG2P(language="en-us", use_spacy=True, spacy_model="en_core_web_sm")
Examples
from kokorog2p import phonemize
# Context-dependent pronunciation
print(phonemize("I read a book.", language="en-us"))
# → ˈaɪ ɹˈɛd ə bˈʊk.
print(phonemize("I will read tomorrow.", language="en-us"))
# → ˈaɪ wɪl ɹˈid təmˈɑɹO.
# Numbers and currency
print(phonemize("I paid $1,234.56 for it.", language="en-us"))
# → aɪ pˈeɪd wʌn θˈaʊzənd tˈu hˈʌndɹəd...
German (de)
German G2P uses a large dictionary (738k+ entries from Olaph) with rule-based fallback.
Features
Large dictionary: 738k+ entries with stress markers
Phonological rules:
Final obstruent devoicing (Auslautverhärtung)
ich-Laut [ç] vs ach-Laut [x] alternation
Word-initial sp/st → [ʃp]/[ʃt]
Vowel length rules
Schwa in unstressed syllables
Number handling: Cardinals, ordinals, years, currency
Regional variants: de-de, de-at, de-ch
Usage
from kokorog2p.de import GermanG2P
g2p = GermanG2P(
language="de-de",
use_espeak_fallback=True,
strip_stress=True
)
Examples
from kokorog2p import phonemize
# Basic phonemization
print(str(phonemize("Guten Tag", language="de")))
# → ɡuːtn̩ taːk
# Phonological rules
print(str(phonemize("ich", language="de"))) # → ɪç (ich-Laut)
print(str(phonemize("ach", language="de"))) # → ax (ach-Laut)
print(str(phonemize("Tag", language="de"))) # → taːk (final devoicing)
# Numbers
print(str(phonemize("Ich habe 42 Euro.", language="de")))
# → ɪç haːbə t͡svaɪ̯ʊntfɪɐ̯t͡sɪç ɔɪ̯ʁo.
French (fr)
French G2P uses a gold dictionary with espeak-ng fallback.
Features
Gold dictionary: High-quality French pronunciations
Number handling: Cardinals, ordinals, currency
espeak-ng fallback: For out-of-vocabulary words
Usage
from kokorog2p.fr import FrenchG2P
g2p = FrenchG2P(
language="fr-fr",
use_espeak_fallback=True
)
Examples
from kokorog2p import phonemize
print(phonemize("Bonjour le monde", language="fr"))
# → bɔ̃ʒuʁ lə mɔ̃d
print(phonemize("J'ai vingt et un ans.", language="fr"))
# → ʒɛ vɛ̃t e œ̃ ɑ̃.
Czech (cs)
Czech G2P is entirely rule-based with comprehensive phonological rules.
Features
Rule-based phonology:
Palatalization (d+i → ɟ, t+i → c, n+i → ɲ)
Long vowels (á → aː, í → iː, etc.)
ř phoneme [r̝]
ch digraph → [x]
Final devoicing
Voicing assimilation
No dictionary required: Works with any Czech text
Usage
from kokorog2p.cs import CzechG2P
g2p = CzechG2P(language="cs-cz")
Examples
from kokorog2p import phonemize
print(phonemize("Dobrý den", language="cs"))
# → dobriː dɛn
print(phonemize("Praha", language="cs"))
# → praɦa
# Palatalization
print(phonemize("děti", language="cs"))
# → ɟɛcɪ
# ř phoneme
print(phonemize("řeka", language="cs"))
# → r̝ɛka
Spanish (es)
Spanish G2P is rule-based with comprehensive phonological rules for both European and Latin American dialects.
Features
Rule-based phonology:
5 pure vowels (a, e, i, o, u)
Stress prediction (penultimate for vowel-ending, final for consonant-ending)
Palatal sounds: ñ [ɲ], ll [ʎ] or [j]
Jota: j/g+e/i [x]
Theta: z/c+e/i [θ] (European) or [s] (Latin American)
Tap vs trill: r [ɾ] vs rr [r]
Dialect support: es (European), la (Latin American)
Number handling: Cardinals, ordinals, currency
Usage
from kokorog2p.es import SpanishG2P
g2p = SpanishG2P(
language="es",
dialect="es" # or "la" for Latin American
)
Examples
from kokorog2p import phonemize
print(phonemize("Hola mundo", language="es"))
# → ola mundo
# Phonological features
print(phonemize("año", language="es")) # → aɲo
print(phonemize("calle", language="es")) # → kaʎe or kaje
print(phonemize("perro", language="es")) # → pero (trilled r)
Italian (it)
Italian G2P uses rule-based phonology with predictable stress and gemination handling.
Features
Rule-based phonology:
5 pure vowels (a, e, i, o, u) - no reduction
Predictable stress (usually penultimate)
Gemination (double consonants) preservation
Palatals: gn [ɲ], gli [ʎ]
Affricates: z [ʦ/ʣ], c/ci [ʧ], g/gi [ʤ]
Context-sensitive c/g pronunciation
Stress marking: Automatic stress detection from accents
Number handling: Cardinals, ordinals
Usage
from kokorog2p.it import ItalianG2P
g2p = ItalianG2P(
language="it-it",
mark_stress=True,
mark_gemination=True
)
Examples
from kokorog2p import phonemize
print(phonemize("Ciao mondo", language="it"))
# → ʧao mondo
# Gemination
print(phonemize("anno", language="it")) # → anːo
print(phonemize("fatto", language="it")) # → fatːo
# Palatals
print(phonemize("gnocchi", language="it")) # → ɲɔkːi
print(phonemize("figlio", language="it")) # → fiʎo
Portuguese (pt)
Portuguese G2P supports Brazilian Portuguese with comprehensive phonological rules.
Features
Rule-based phonology:
7 oral vowels (a, e, ɛ, i, o, ɔ, u)
5 nasal vowels (ã, ẽ, ĩ, õ, ũ)
Nasal diphthongs
Palatalization: lh [ʎ], nh [ɲ], x/ch [ʃ]
Affrication: t+i [ʧ], d+i [ʤ] (Brazilian)
Sibilants: s [s/z], x [ʃ], z [z]
Liquids: r [ʁ/x/h], rr [ʁ/x], single r [ɾ]
Dialect: Brazilian Portuguese (pt-br)
Stress marking: Automatic stress assignment
Usage
from kokorog2p.pt import PortugueseG2P
g2p = PortugueseG2P(
language="pt-br",
mark_stress=True,
affricate_ti_di=True # Brazilian feature
)
Examples
from kokorog2p import phonemize
print(phonemize("Olá mundo", language="pt"))
# → ola mundo
# Nasal vowels
print(phonemize("mãe", language="pt")) # → mãj̃
print(phonemize("pão", language="pt")) # → pãw̃
# Affrication (Brazilian)
print(phonemize("tia", language="pt")) # → ʧia
print(phonemize("dia", language="pt")) # → ʤia
Chinese (zh)
Chinese G2P uses jieba for tokenization and pypinyin for phoneme conversion.
Features
Jieba tokenization: Chinese word segmentation
Pypinyin conversion: Pinyin to IPA
Tone sandhi: Automatic tone changes
cn2an: Number to Chinese conversion
Punctuation mapping: Chinese to Western punctuation
Usage
from kokorog2p.zh import ChineseG2P
g2p = ChineseG2P(
language="zh",
version="1.1"
)
Examples
from kokorog2p import phonemize
print(phonemize("你好世界", language="zh"))
# → nǐ hǎo shì jiè (with tone markers)
Japanese (ja)
Japanese G2P uses pyopenjtalk for text analysis and mora-based phoneme generation.
Features
pyopenjtalk: Full Japanese text analysis
Mora-based: Phonemes aligned with mora structure
Pitch accent: Automatic pitch accent assignment
Number handling: Japanese numerals
Usage
from kokorog2p.ja import JapaneseG2P
g2p = JapaneseG2P(
language="ja",
version="pyopenjtalk"
)
Examples
from kokorog2p import phonemize
print(phonemize("こんにちは", language="ja"))
# → koɴɲit͡ɕiha
print(phonemize("世界", language="ja"))
# → sekai
Korean (ko)
Korean G2P uses MeCab-based morphological analysis with comprehensive phonological rules.
Features
MeCab integration: Korean morphological analysis
Phonological rules:
Consonant assimilation
Palatalization
Tensification
Aspiration
Liaison (연음)
Final consonant neutralization
Hanja support: Sino-Korean character handling
Number handling: Korean numerals
Usage
from kokorog2p.ko import KoreanG2P
g2p = KoreanG2P(
language="ko-kr",
use_mecab=True
)
Examples
from kokorog2p import phonemize
print(phonemize("안녕하세요", language="ko"))
# → annjʌŋhasejo
# Phonological rules
print(phonemize("학교", language="ko")) # → hakk͈jo (tensification)
print(phonemize("받침", language="ko")) # → patʃʰim (palatalization)
Hebrew (he)
Hebrew G2P uses phonikud for nikud-based phonemization.
Features
phonikud integration: Hebrew nikud to IPA conversion
Nikud handling: Processes diacritical marks for vowels
Stress prediction: Automatic stress assignment
Modern Hebrew: Optimized for contemporary pronunciation
Usage
from kokorog2p.he import HebrewG2P
g2p = HebrewG2P(
language="he-il",
preserve_punctuation=True,
preserve_stress=True
)
Examples
from kokorog2p import phonemize
# Requires nikud (diacritical marks)
print(phonemize("שָׁלוֹם", language="he"))
# → ʃalom
print(phonemize("עִבְרִית", language="he"))
# → ivʁit
Mixed-Language Support
kokorog2p can automatically detect and handle texts that mix multiple languages, routing each word to the appropriate G2P engine.
Features
Automatic detection: Word-level language detection using lingua-py
High accuracy: >90% accuracy for words with 5+ characters
Caching: Detection results cached for performance
Configurable threshold: Control detection sensitivity
Graceful degradation: Falls back to primary language without lingua-py
17+ languages: Support for major world languages
Supported Languages
English (en-us, en-gb)
German (de)
French (fr)
Spanish (es)
Italian (it)
Portuguese (pt)
Japanese (ja)
Chinese (zh)
Korean (ko)
Hebrew (he)
Czech (cs)
Dutch (nl)
Polish (pl)
Russian (ru)
Arabic (ar)
Hindi (hi)
Turkish (tr)
Usage
from kokorog2p import phonemize
from kokorog2p.multilang import preprocess_multilang
text = "Das Meeting war great!"
overrides = preprocess_multilang(
text,
default_language="de",
allowed_languages=["de", "en-us"],
)
result = phonemize(text, lang="de", overrides=overrides, result_type="result")
Examples
German with English:
from kokorog2p import phonemize
from kokorog2p.multilang import preprocess_multilang
text = "Ich gehe zum Meeting. Let's discuss the Roadmap!"
overrides = preprocess_multilang(
text,
default_language="de",
allowed_languages=["de", "en-us"],
)
result = phonemize(text, lang="de", overrides=overrides, result_type="result")
print(result.phonemes)
English with German:
overrides = preprocess_multilang(
"Hello, mein Freund! This is wunderbar.",
default_language="en-us",
allowed_languages=["en-us", "de"],
)
result = phonemize(
"Hello, mein Freund! This is wunderbar.",
language="en-us",
overrides=overrides)
)
print(result.phonemes)
Multiple languages:
overrides = preprocess_multilang(
"Bonjour! The Meeting ist wichtig.",
default_language="fr",
allowed_languages=["fr", "en-us", "de"],
)
result = phonemize(
"Bonjour! The Meeting ist wichtig.",
language="fr",
overrides=overrides,
)
print(result.phonemes)
Configuration
Confidence threshold:
from kokorog2p.multilang import preprocess_multilang
# Conservative (higher confidence required)
overrides = preprocess_multilang(
"Das Meeting ist wichtig",
default_language="de",
allowed_languages=["de", "en-us"],
confidence_threshold=0.9, # Default: 0.7
)
# Aggressive (lower confidence required)
overrides = preprocess_multilang(
"Das Meeting ist wichtig",
default_language="de",
allowed_languages=["de", "en-us"],
confidence_threshold=0.5,
)
How It Works
Text is tokenized into words
Each word is sent to the language detector
Detector returns language + confidence score
If confidence ≥ threshold and language is allowed:
An
OverrideSpanis created with{"lang": "..."}Short words (<3 chars) keep the default language
Performance
Memory: ~100 MB for lingua models (loaded once)
Speed: ~0.1-0.5 ms per word
Accuracy: >90% for words with 5+ characters
Limitations
Short words (<3 characters) use the default language only
Proper nouns may be misdetected
Requires
lingua-language-detectorinstallationDetection quality varies by word distinctiveness
Installation
pip install kokorog2p[mixed]
Language-Specific Number Handling
English
from kokorog2p.en.numbers import expand_number
print(expand_number("I have $42.50"))
# → I have forty-two dollars and fifty cents
German
from kokorog2p.de.numbers import expand_number
print(expand_number("Ich habe 42 Euro."))
# → Ich habe zweiundvierzig Euro.
French
from kokorog2p.fr.numbers import expand_number
print(expand_number("J'ai 42 euros."))
# → J'ai quarante-deux euros.
Fallback Languages
For languages not explicitly supported, kokorog2p falls back to espeak-ng:
from kokorog2p import get_g2p
# Spanish (uses espeak-ng)
g2p_es = get_g2p("es-es")
# Italian (uses espeak-ng)
g2p_it = get_g2p("it-it")
# Portuguese (uses espeak-ng)
g2p_pt = get_g2p("pt-br")
This provides basic support for 100+ languages via espeak-ng.
Next Steps
See Advanced Usage for advanced usage patterns
Check language-specific API docs: