Czech API
Czech G2P provides rule-based phoneme conversion with comprehensive phonological rules.
Main Class
- class kokorog2p.cs.CzechG2P(language: str = 'cs-cz', use_espeak_fallback: bool = False, use_goruut_fallback: bool = False, unk: str = '?', load_silver: bool = True, load_gold: bool = True, version: str = '1.0', expand_abbreviations: bool = True, enable_context_detection: bool = True, **kwargs: Any)[source]
Bases:
G2PBaseCzech G2P converter using rule-based phoneme conversion with fallback options.
This class provides grapheme-to-phoneme conversion for Czech text using phonological rules for voicing assimilation, palatalization, and other Czech-specific features, with optional fallback to espeak or goruut.
- Example:
>>> g2p = CzechG2P() >>> tokens = g2p("Dobrý den") >>> for token in tokens: ... print(f"{token.text} -> {token.phonemes}")
- __init__(language: str = 'cs-cz', use_espeak_fallback: bool = False, use_goruut_fallback: bool = False, unk: str = '?', load_silver: bool = True, load_gold: bool = True, version: str = '1.0', expand_abbreviations: bool = True, enable_context_detection: bool = True, **kwargs: Any) None[source]
Initialize the Czech G2P converter.
- Args:
language: Language code (default: ‘cs-cz’). use_espeak_fallback: Whether to use espeak for OOV words. use_goruut_fallback: Whether to use goruut for OOV words. unk: Character to use for unknown characters. load_silver: If True, load silver tier dictionary if available.
Currently Czech uses rule-based G2P, so this parameter is reserved for future use and consistency. Defaults to True for consistency.
- load_gold: If True, load gold tier dictionary if available.
Currently Czech uses rule-based G2P, so this parameter is reserved for future use and consistency. Defaults to True for consistency.
- expand_abbreviations: If True, expand common abbreviations
(e.g., “Dr.” → “Doktor”). Defaults to True.
- enable_context_detection: If True, use context-aware expansion
for ambiguous abbreviations. Defaults to True.
- Raises:
ValueError: If both use_espeak_fallback and use_goruut_fallback are True.
- __call__(text: str) list[GToken][source]
Convert text to a list of tokens with phonemes.
- Args:
text: Input text to convert.
- Returns:
List of GToken objects with phonemes assigned.
Examples
from kokorog2p.cs import CzechG2P
g2p = CzechG2P(language="cs-cz")
tokens = g2p("Dobrý den, jak se máte?")
for token in tokens:
print(f"{token.text} -> {token.phonemes}")
Phonological Rules
Czech G2P implements the following phonological rules:
Palatalization: d+i → ɟ, t+i → c, n+i → ɲ
Long vowels: á → aː, í → iː, ú/ů → uː, é → eː, ó → oː
ř phoneme: Special raised alveolar trill [r̝]
CH digraph: ch → [x]
Final devoicing: Voiced consonants become voiceless at word end
Voicing assimilation: Consonant clusters assimilate in voicing