Chinese API

Chinese G2P uses jieba for tokenization and supports two phoneme output formats.

Main Class

class kokorog2p.zh.ChineseG2P(language: str = 'zh', use_espeak_fallback: bool = True, use_spacy: bool = False, spacy_model: str = 'zh_core_web_sm', version: str = '1.1', unk: str = '', en_callable=None, load_silver: bool = True, load_gold: bool = True, **kwargs)[source]

Bases: G2PBase

Chinese G2P using pypinyin and IPA transcription.

This class converts Chinese text to IPA phonemes using: 1. Jieba for word segmentation 2. pypinyin for pinyin extraction 3. Custom pinyin-to-IPA mapping

Example:
>>> g2p = ChineseG2P()
>>> tokens = g2p("你好世界")
__init__(language: str = 'zh', use_espeak_fallback: bool = True, use_spacy: bool = False, spacy_model: str = 'zh_core_web_sm', version: str = '1.1', unk: str = '', en_callable=None, load_silver: bool = True, load_gold: bool = True, **kwargs) None[source]

Initialize the Chinese G2P.

Args:

language: Language code (e.g., ‘zh’, ‘zh-cn’). use_espeak_fallback: Whether to use espeak for English words. use_spacy: Reserved for API consistency. Chinese uses jieba/pypinyin

and custom frontend pipelines for tokenization and phonemization.

spacy_model: Reserved for API consistency when use_spacy is enabled. version: Version of the G2P (“1.0” for base model,

“1.1” for ZHFrontend multilingual). Default: “1.1”.

unk: Unknown token placeholder. en_callable: Callable for English word phonemization. load_silver: If True, load silver tier dictionary if available.

Currently Chinese uses pypinyin system, so this parameter is reserved for future use and consistency. Defaults to True for consistency.

load_gold: If True, load gold tier dictionary if available.

Currently Chinese uses pypinyin system, so this parameter is reserved for future use and consistency. Defaults to True for consistency.

**kwargs: Additional arguments.

property frontend

Lazy initialization of ZHFrontend for version 1.1.

property jieba

Lazy import of jieba.

property cn2an

Lazy import of cn2an.

property pypinyin

Lazy import of pypinyin.

property transcription

Lazy import of transcription module.

static retone(p: str) str[source]

Convert tone markers to simpler format.

py2ipa(py: str) str[source]

Convert pinyin to IPA.

word2ipa(w: str) str[source]

Convert a Chinese word to IPA via pinyin.

static map_punctuation(text: str) str[source]

Convert Chinese punctuation to ASCII equivalents.

legacy_call(text: str) str[source]

Legacy phonemization using jieba and pypinyin directly.

__call__(text: str) list[GToken][source]

Convert text to tokens with phonemes.

Args:

text: Input text to convert.

Returns:

List of GToken objects with phonemes.

lookup(word: str, tag: str | None = None) str | None[source]

Look up a word’s phonemes.

Args:

word: The word to look up. tag: Optional POS tag (ignored for Chinese).

Returns:

Phoneme string or None.

phonemize(text: str) str[source]

Convert text to phonemes.

Args:

text: Input text to convert.

Returns:

Phoneme string.

get_target_model() str[source]

Get the target Kokoro model variant for this G2P instance.

Returns:

Model identifier: “1.1” for version 1.1, “1.0” otherwise.

Examples

Basic Usage

from kokorog2p.zh import ChineseG2P

g2p = ChineseG2P(language="zh")
tokens = g2p("你好世界")

for token in tokens:
    print(f"{token.text} -> {token.phonemes}")

Model Versions

The Chinese G2P supports two versions with different output formats:

Legacy Version (version=”1.0”)

Uses pypinyin + IPA transcription

  • Output format: IPA with arrow tone markers (↓ ↗ ↘ →)

  • Compatible with base Kokoro model

  • Example: "你好""ni↓xau↓"

from kokorog2p import get_g2p

# Create legacy Chinese G2P
g2p = get_g2p("zh", version="1.0")
phonemes = g2p.phonemize("你好")
# Output: 'ni↓xau↓'

Version 1.1 (version=”1.1”)

Uses ZHFrontend with Zhuyin (Bopomofo) notation

  • Output format: Zhuyin characters + tone numbers (1-5)

  • Requires Kokoro-82M-v1.1-zh model

  • Example: "你好""ㄋㄧ2ㄏㄠ3"

from kokorog2p import get_g2p
from kokorog2p.vocab import validate_for_kokoro

# Create v1.1 Chinese G2P
g2p = get_g2p("zh", version="1.1")
phonemes = g2p.phonemize("你好")
# Output: 'ㄋㄧ2ㄏㄠ3'

# Validate against v1.1-zh model
is_valid, invalid = validate_for_kokoro(phonemes, model="1.1")
assert is_valid

Model Selection for Validation

When validating phonemes, specify the target model:

from kokorog2p.vocab import validate_for_kokoro

# For base model (IPA output from legacy version)
is_valid, invalid = validate_for_kokoro(phonemes, model="1.0")

# For v1.1-zh model (Zhuyin output from version 1.1)
is_valid, invalid = validate_for_kokoro(phonemes, model="1.1")

Features

  • Jieba tokenization for Chinese word segmentation

  • Pypinyin for pinyin conversion to IPA (legacy version)

  • ZHFrontend with Zhuyin notation (version 1.1)

  • Tone sandhi rules

  • cn2an for number handling

  • Chinese to Western punctuation mapping