Chinese API
Chinese G2P uses jieba for tokenization and supports two phoneme output formats.
Main Class
- class kokorog2p.zh.ChineseG2P(language: str = 'zh', use_espeak_fallback: bool = True, use_spacy: bool = False, spacy_model: str = 'zh_core_web_sm', version: str = '1.1', unk: str = '', en_callable=None, load_silver: bool = True, load_gold: bool = True, **kwargs)[source]
Bases:
G2PBaseChinese G2P using pypinyin and IPA transcription.
This class converts Chinese text to IPA phonemes using: 1. Jieba for word segmentation 2. pypinyin for pinyin extraction 3. Custom pinyin-to-IPA mapping
- Example:
>>> g2p = ChineseG2P() >>> tokens = g2p("你好世界")
- __init__(language: str = 'zh', use_espeak_fallback: bool = True, use_spacy: bool = False, spacy_model: str = 'zh_core_web_sm', version: str = '1.1', unk: str = '', en_callable=None, load_silver: bool = True, load_gold: bool = True, **kwargs) None[source]
Initialize the Chinese G2P.
- Args:
language: Language code (e.g., ‘zh’, ‘zh-cn’). use_espeak_fallback: Whether to use espeak for English words. use_spacy: Reserved for API consistency. Chinese uses jieba/pypinyin
and custom frontend pipelines for tokenization and phonemization.
spacy_model: Reserved for API consistency when use_spacy is enabled. version: Version of the G2P (“1.0” for base model,
“1.1” for ZHFrontend multilingual). Default: “1.1”.
unk: Unknown token placeholder. en_callable: Callable for English word phonemization. load_silver: If True, load silver tier dictionary if available.
Currently Chinese uses pypinyin system, so this parameter is reserved for future use and consistency. Defaults to True for consistency.
- load_gold: If True, load gold tier dictionary if available.
Currently Chinese uses pypinyin system, so this parameter is reserved for future use and consistency. Defaults to True for consistency.
**kwargs: Additional arguments.
- property frontend
Lazy initialization of ZHFrontend for version 1.1.
- property jieba
Lazy import of jieba.
- property cn2an
Lazy import of cn2an.
- property pypinyin
Lazy import of pypinyin.
- property transcription
Lazy import of transcription module.
- __call__(text: str) list[GToken][source]
Convert text to tokens with phonemes.
- Args:
text: Input text to convert.
- Returns:
List of GToken objects with phonemes.
- lookup(word: str, tag: str | None = None) str | None[source]
Look up a word’s phonemes.
- Args:
word: The word to look up. tag: Optional POS tag (ignored for Chinese).
- Returns:
Phoneme string or None.
Examples
Basic Usage
from kokorog2p.zh import ChineseG2P
g2p = ChineseG2P(language="zh")
tokens = g2p("你好世界")
for token in tokens:
print(f"{token.text} -> {token.phonemes}")
Model Versions
The Chinese G2P supports two versions with different output formats:
Legacy Version (version=”1.0”)
Uses pypinyin + IPA transcription
Output format: IPA with arrow tone markers (↓ ↗ ↘ →)
Compatible with base Kokoro model
Example:
"你好"→"ni↓xau↓"
from kokorog2p import get_g2p
# Create legacy Chinese G2P
g2p = get_g2p("zh", version="1.0")
phonemes = g2p.phonemize("你好")
# Output: 'ni↓xau↓'
Version 1.1 (version=”1.1”)
Uses ZHFrontend with Zhuyin (Bopomofo) notation
Output format: Zhuyin characters + tone numbers (1-5)
Requires Kokoro-82M-v1.1-zh model
Example:
"你好"→"ㄋㄧ2ㄏㄠ3"
from kokorog2p import get_g2p
from kokorog2p.vocab import validate_for_kokoro
# Create v1.1 Chinese G2P
g2p = get_g2p("zh", version="1.1")
phonemes = g2p.phonemize("你好")
# Output: 'ㄋㄧ2ㄏㄠ3'
# Validate against v1.1-zh model
is_valid, invalid = validate_for_kokoro(phonemes, model="1.1")
assert is_valid
Model Selection for Validation
When validating phonemes, specify the target model:
from kokorog2p.vocab import validate_for_kokoro
# For base model (IPA output from legacy version)
is_valid, invalid = validate_for_kokoro(phonemes, model="1.0")
# For v1.1-zh model (Zhuyin output from version 1.1)
is_valid, invalid = validate_for_kokoro(phonemes, model="1.1")
Features
Jieba tokenization for Chinese word segmentation
Pypinyin for pinyin conversion to IPA (legacy version)
ZHFrontend with Zhuyin notation (version 1.1)
Tone sandhi rules
cn2an for number handling
Chinese to Western punctuation mapping