Language Support

kokorog2p supports multiple languages with varying levels of functionality.

Language Support Overview

Language

Code

Dictionary

Fallback

Special Features

English (US)

en-us

100k+ entries

espeak-ng

POS tagging, stress, numbers

English (GB)

en-gb

100k+ entries

espeak-ng

POS tagging, stress, numbers

German

de

738k+ entries

espeak-ng

Phonological rules, numbers

French

fr

Gold dictionary

espeak-ng

Numbers, liaison rules

Spanish

es

Rule-based

espeak-ng/goruut

Phonological rules, numbers

Italian

it

Rule-based

espeak-ng/goruut

Phonological rules, gemination

Portuguese

pt

Rule-based

Phonological rules, nasalization

Czech

cs

Rule-based

espeak-ng/goruut

Phonological rules

Chinese

zh

pypinyin

Tone sandhi, pinyin

Japanese

ja

pyopenjtalk

Mora-based, pitch accent

Korean

ko

MeCab

Phonological rules, liaison

Hebrew

he

phonikud

Nikud handling, stress

Mixed

multilingual

Auto-detect

lingua-py

17+ languages, word-level detection

English (en-us, en-gb)

English G2P uses a two-tier dictionary system with spaCy for POS tagging.

Features

  • Gold dictionary: 50k+ high-confidence entries

  • Silver dictionary: Additional 50k+ entries

  • POS-aware pronunciation: Different pronunciations based on part of speech

  • Stress assignment: Primary and secondary stress markers

  • Number handling: Cardinals, ordinals, currency

  • Contraction support: Proper handling of “can’t”, “won’t”, etc.

Usage

from kokorog2p.en import EnglishG2P

# US English
g2p_us = EnglishG2P(
    language="en-us",
    use_espeak_fallback=True,
    use_spacy=True,
    spacy_model="en_core_web_md",  # default
)

# British English
g2p_gb = EnglishG2P(
    language="en-gb",
    use_espeak_fallback=True,
    use_spacy=True,
    spacy_model="en_core_web_md",  # default
)

# Optional: select a different spaCy English model
g2p_sm = EnglishG2P(language="en-us", use_spacy=True, spacy_model="en_core_web_sm")

Examples

from kokorog2p import phonemize

# Context-dependent pronunciation
print(phonemize("I read a book.", language="en-us"))
# → ˈaɪ ɹˈɛd ə bˈʊk.

print(phonemize("I will read tomorrow.", language="en-us"))
# → ˈaɪ wɪl ɹˈid təmˈɑɹO.

# Numbers and currency
print(phonemize("I paid $1,234.56 for it.", language="en-us"))
# → aɪ pˈeɪd wʌn θˈaʊzənd tˈu hˈʌndɹəd...

German (de)

German G2P uses a large dictionary (738k+ entries from Olaph) with rule-based fallback.

Features

  • Large dictionary: 738k+ entries with stress markers

  • Phonological rules:

    • Final obstruent devoicing (Auslautverhärtung)

    • ich-Laut [ç] vs ach-Laut [x] alternation

    • Word-initial sp/st → [ʃp]/[ʃt]

    • Vowel length rules

    • Schwa in unstressed syllables

  • Number handling: Cardinals, ordinals, years, currency

  • Regional variants: de-de, de-at, de-ch

Usage

from kokorog2p.de import GermanG2P

g2p = GermanG2P(
    language="de-de",
    use_espeak_fallback=True,
    strip_stress=True
)

Examples

from kokorog2p import phonemize

# Basic phonemization
print(str(phonemize("Guten Tag", language="de")))
# → ɡuːtn̩ taːk

# Phonological rules
print(str(phonemize("ich", language="de")))      # → ɪç (ich-Laut)
print(str(phonemize("ach", language="de")))      # → ax (ach-Laut)
print(str(phonemize("Tag", language="de")))      # → taːk (final devoicing)

# Numbers
print(str(phonemize("Ich habe 42 Euro.", language="de")))
# → ɪç haːbə t͡svaɪ̯ʊntfɪɐ̯t͡sɪç ɔɪ̯ʁo.

French (fr)

French G2P uses a gold dictionary with espeak-ng fallback.

Features

  • Gold dictionary: High-quality French pronunciations

  • Number handling: Cardinals, ordinals, currency

  • espeak-ng fallback: For out-of-vocabulary words

Usage

from kokorog2p.fr import FrenchG2P

g2p = FrenchG2P(
    language="fr-fr",
    use_espeak_fallback=True
)

Examples

from kokorog2p import phonemize

print(phonemize("Bonjour le monde", language="fr"))
# → bɔ̃ʒuʁ lə mɔ̃d

print(phonemize("J'ai vingt et un ans.", language="fr"))
# → ʒɛ vɛ̃t e œ̃ ɑ̃.

Czech (cs)

Czech G2P is entirely rule-based with comprehensive phonological rules.

Features

  • Rule-based phonology:

    • Palatalization (d+i → ɟ, t+i → c, n+i → ɲ)

    • Long vowels (á → aː, í → iː, etc.)

    • ř phoneme [r̝]

    • ch digraph → [x]

    • Final devoicing

    • Voicing assimilation

  • No dictionary required: Works with any Czech text

Usage

from kokorog2p.cs import CzechG2P

g2p = CzechG2P(language="cs-cz")

Examples

from kokorog2p import phonemize

print(phonemize("Dobrý den", language="cs"))
# → dobriː dɛn

print(phonemize("Praha", language="cs"))
# → praɦa

# Palatalization
print(phonemize("děti", language="cs"))
# → ɟɛcɪ

 # ř phoneme
 print(phonemize("řeka", language="cs"))
 # → r̝ɛka

Spanish (es)

Spanish G2P is rule-based with comprehensive phonological rules for both European and Latin American dialects.

Features

  • Rule-based phonology:

    • 5 pure vowels (a, e, i, o, u)

    • Stress prediction (penultimate for vowel-ending, final for consonant-ending)

    • Palatal sounds: ñ [ɲ], ll [ʎ] or [j]

    • Jota: j/g+e/i [x]

    • Theta: z/c+e/i [θ] (European) or [s] (Latin American)

    • Tap vs trill: r [ɾ] vs rr [r]

  • Dialect support: es (European), la (Latin American)

  • Number handling: Cardinals, ordinals, currency

Usage

from kokorog2p.es import SpanishG2P

g2p = SpanishG2P(
    language="es",
    dialect="es"  # or "la" for Latin American
)

Examples

from kokorog2p import phonemize

print(phonemize("Hola mundo", language="es"))
# → ola mundo

# Phonological features
print(phonemize("año", language="es"))      # → aɲo
print(phonemize("calle", language="es"))    # → kaʎe or kaje
print(phonemize("perro", language="es"))    # → pero (trilled r)

Italian (it)

Italian G2P uses rule-based phonology with predictable stress and gemination handling.

Features

  • Rule-based phonology:

    • 5 pure vowels (a, e, i, o, u) - no reduction

    • Predictable stress (usually penultimate)

    • Gemination (double consonants) preservation

    • Palatals: gn [ɲ], gli [ʎ]

    • Affricates: z [ʦ/ʣ], c/ci [ʧ], g/gi [ʤ]

    • Context-sensitive c/g pronunciation

  • Stress marking: Automatic stress detection from accents

  • Number handling: Cardinals, ordinals

Usage

from kokorog2p.it import ItalianG2P

g2p = ItalianG2P(
    language="it-it",
    mark_stress=True,
    mark_gemination=True
)

Examples

from kokorog2p import phonemize

print(phonemize("Ciao mondo", language="it"))
# → ʧao mondo

# Gemination
print(phonemize("anno", language="it"))     # → anːo
print(phonemize("fatto", language="it"))    # → fatːo

# Palatals
print(phonemize("gnocchi", language="it"))  # → ɲɔkːi
print(phonemize("figlio", language="it"))   # → fiʎo

Portuguese (pt)

Portuguese G2P supports Brazilian Portuguese with comprehensive phonological rules.

Features

  • Rule-based phonology:

    • 7 oral vowels (a, e, ɛ, i, o, ɔ, u)

    • 5 nasal vowels (ã, ẽ, ĩ, õ, ũ)

    • Nasal diphthongs

    • Palatalization: lh [ʎ], nh [ɲ], x/ch [ʃ]

    • Affrication: t+i [ʧ], d+i [ʤ] (Brazilian)

    • Sibilants: s [s/z], x [ʃ], z [z]

    • Liquids: r [ʁ/x/h], rr [ʁ/x], single r [ɾ]

  • Dialect: Brazilian Portuguese (pt-br)

  • Stress marking: Automatic stress assignment

Usage

from kokorog2p.pt import PortugueseG2P

g2p = PortugueseG2P(
    language="pt-br",
    mark_stress=True,
    affricate_ti_di=True  # Brazilian feature
)

Examples

from kokorog2p import phonemize

print(phonemize("Olá mundo", language="pt"))
# → ola mundo

# Nasal vowels
print(phonemize("mãe", language="pt"))      # → mãj̃
print(phonemize("pão", language="pt"))      # → pãw̃

# Affrication (Brazilian)
print(phonemize("tia", language="pt"))      # → ʧia
print(phonemize("dia", language="pt"))      # → ʤia

Chinese (zh)

Chinese G2P uses jieba for tokenization and pypinyin for phoneme conversion.

Features

  • Jieba tokenization: Chinese word segmentation

  • Pypinyin conversion: Pinyin to IPA

  • Tone sandhi: Automatic tone changes

  • cn2an: Number to Chinese conversion

  • Punctuation mapping: Chinese to Western punctuation

Usage

from kokorog2p.zh import ChineseG2P

g2p = ChineseG2P(
    language="zh",
    version="1.1"
)

Examples

from kokorog2p import phonemize

print(phonemize("你好世界", language="zh"))
# → nǐ hǎo shì jiè (with tone markers)

Japanese (ja)

Japanese G2P uses pyopenjtalk for text analysis and mora-based phoneme generation.

Features

  • pyopenjtalk: Full Japanese text analysis

  • Mora-based: Phonemes aligned with mora structure

  • Pitch accent: Automatic pitch accent assignment

  • Number handling: Japanese numerals

Usage

from kokorog2p.ja import JapaneseG2P

g2p = JapaneseG2P(
    language="ja",
    version="pyopenjtalk"
)

Examples

from kokorog2p import phonemize

print(phonemize("こんにちは", language="ja"))
# → koɴɲit͡ɕiha

print(phonemize("世界", language="ja"))
# → sekai

Korean (ko)

Korean G2P uses MeCab-based morphological analysis with comprehensive phonological rules.

Features

  • MeCab integration: Korean morphological analysis

  • Phonological rules:

    • Consonant assimilation

    • Palatalization

    • Tensification

    • Aspiration

    • Liaison (연음)

    • Final consonant neutralization

  • Hanja support: Sino-Korean character handling

  • Number handling: Korean numerals

Usage

from kokorog2p.ko import KoreanG2P

g2p = KoreanG2P(
    language="ko-kr",
    use_mecab=True
)

Examples

from kokorog2p import phonemize

print(phonemize("안녕하세요", language="ko"))
# → annjʌŋhasejo

# Phonological rules
print(phonemize("학교", language="ko"))     # → hakk͈jo (tensification)
print(phonemize("받침", language="ko"))     # → patʃʰim (palatalization)

Hebrew (he)

Hebrew G2P uses phonikud for nikud-based phonemization.

Features

  • phonikud integration: Hebrew nikud to IPA conversion

  • Nikud handling: Processes diacritical marks for vowels

  • Stress prediction: Automatic stress assignment

  • Modern Hebrew: Optimized for contemporary pronunciation

Usage

from kokorog2p.he import HebrewG2P

g2p = HebrewG2P(
    language="he-il",
    preserve_punctuation=True,
    preserve_stress=True
)

Examples

from kokorog2p import phonemize

# Requires nikud (diacritical marks)
print(phonemize("שָׁלוֹם", language="he"))
# → ʃalom

print(phonemize("עִבְרִית", language="he"))
# → ivʁit

Mixed-Language Support

kokorog2p can automatically detect and handle texts that mix multiple languages, routing each word to the appropriate G2P engine.

Features

  • Automatic detection: Word-level language detection using lingua-py

  • High accuracy: >90% accuracy for words with 5+ characters

  • Caching: Detection results cached for performance

  • Configurable threshold: Control detection sensitivity

  • Graceful degradation: Falls back to primary language without lingua-py

  • 17+ languages: Support for major world languages

Supported Languages

  • English (en-us, en-gb)

  • German (de)

  • French (fr)

  • Spanish (es)

  • Italian (it)

  • Portuguese (pt)

  • Japanese (ja)

  • Chinese (zh)

  • Korean (ko)

  • Hebrew (he)

  • Czech (cs)

  • Dutch (nl)

  • Polish (pl)

  • Russian (ru)

  • Arabic (ar)

  • Hindi (hi)

  • Turkish (tr)

Usage

from kokorog2p import phonemize
from kokorog2p.multilang import preprocess_multilang

text = "Das Meeting war great!"
overrides = preprocess_multilang(
    text,
    default_language="de",
    allowed_languages=["de", "en-us"],
)

result = phonemize(text, lang="de", overrides=overrides, result_type="result")

Examples

German with English:

from kokorog2p import phonemize
from kokorog2p.multilang import preprocess_multilang

text = "Ich gehe zum Meeting. Let's discuss the Roadmap!"
overrides = preprocess_multilang(
    text,
    default_language="de",
    allowed_languages=["de", "en-us"],
)
result = phonemize(text, lang="de", overrides=overrides, result_type="result")
print(result.phonemes)

English with German:

overrides = preprocess_multilang(
    "Hello, mein Freund! This is wunderbar.",
    default_language="en-us",
    allowed_languages=["en-us", "de"],
)
result = phonemize(
    "Hello, mein Freund! This is wunderbar.",
    language="en-us",
    overrides=overrides)
)
print(result.phonemes)

Multiple languages:

overrides = preprocess_multilang(
    "Bonjour! The Meeting ist wichtig.",
    default_language="fr",
    allowed_languages=["fr", "en-us", "de"],
)
result = phonemize(
    "Bonjour! The Meeting ist wichtig.",
    language="fr",
    overrides=overrides,
)
print(result.phonemes)

Configuration

Confidence threshold:

from kokorog2p.multilang import preprocess_multilang

# Conservative (higher confidence required)
overrides = preprocess_multilang(
    "Das Meeting ist wichtig",
    default_language="de",
    allowed_languages=["de", "en-us"],
    confidence_threshold=0.9,  # Default: 0.7
)

# Aggressive (lower confidence required)
overrides = preprocess_multilang(
    "Das Meeting ist wichtig",
    default_language="de",
    allowed_languages=["de", "en-us"],
    confidence_threshold=0.5,
)

How It Works

  1. Text is tokenized into words

  2. Each word is sent to the language detector

  3. Detector returns language + confidence score

  4. If confidence ≥ threshold and language is allowed:

    • An OverrideSpan is created with {"lang": "..."}

    • Short words (<3 chars) keep the default language

Performance

  • Memory: ~100 MB for lingua models (loaded once)

  • Speed: ~0.1-0.5 ms per word

  • Accuracy: >90% for words with 5+ characters

Limitations

  • Short words (<3 characters) use the default language only

  • Proper nouns may be misdetected

  • Requires lingua-language-detector installation

  • Detection quality varies by word distinctiveness

Installation

pip install kokorog2p[mixed]

Language-Specific Number Handling

English

from kokorog2p.en.numbers import expand_number

print(expand_number("I have $42.50"))
# → I have forty-two dollars and fifty cents

German

from kokorog2p.de.numbers import expand_number

print(expand_number("Ich habe 42 Euro."))
# → Ich habe zweiundvierzig Euro.

French

from kokorog2p.fr.numbers import expand_number

print(expand_number("J'ai 42 euros."))
# → J'ai quarante-deux euros.

Fallback Languages

For languages not explicitly supported, kokorog2p falls back to espeak-ng:

from kokorog2p import get_g2p

# Spanish (uses espeak-ng)
g2p_es = get_g2p("es-es")

# Italian (uses espeak-ng)
g2p_it = get_g2p("it-it")

# Portuguese (uses espeak-ng)
g2p_pt = get_g2p("pt-br")

This provides basic support for 100+ languages via espeak-ng.

Next Steps