Chinese API
===========

Chinese G2P uses jieba for tokenization and supports two phoneme output formats.

Main Class
----------

.. autoclass:: kokorog2p.zh.ChineseG2P
   :members:
   :undoc-members:
   :show-inheritance:

Examples
--------

Basic Usage
^^^^^^^^^^^

.. code-block:: python

   from kokorog2p.zh import ChineseG2P

   g2p = ChineseG2P(language="zh")
   tokens = g2p("你好世界")

   for token in tokens:
       print(f"{token.text} -> {token.phonemes}")

Model Versions
--------------

The Chinese G2P supports two versions with different output formats:

Legacy Version (version="1.0")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Uses pypinyin + IPA transcription

* Output format: IPA with arrow tone markers (↓ ↗ ↘ →)
* Compatible with base Kokoro model
* Example: ``"你好"`` → ``"ni↓xau↓"``

.. code-block:: python

   from kokorog2p import get_g2p

   # Create legacy Chinese G2P
   g2p = get_g2p("zh", version="1.0")
   phonemes = g2p.phonemize("你好")
   # Output: 'ni↓xau↓'

Version 1.1 (version="1.1")
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Uses ZHFrontend with Zhuyin (Bopomofo) notation

* Output format: Zhuyin characters + tone numbers (1-5)
* Requires Kokoro-82M-v1.1-zh model
* Example: ``"你好"`` → ``"ㄋㄧ2ㄏㄠ3"``

.. code-block:: python

   from kokorog2p import get_g2p
   from kokorog2p.vocab import validate_for_kokoro

   # Create v1.1 Chinese G2P
   g2p = get_g2p("zh", version="1.1")
   phonemes = g2p.phonemize("你好")
   # Output: 'ㄋㄧ2ㄏㄠ3'

   # Validate against v1.1-zh model
   is_valid, invalid = validate_for_kokoro(phonemes, model="1.1")
   assert is_valid

Model Selection for Validation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When validating phonemes, specify the target model:

.. code-block:: python

   from kokorog2p.vocab import validate_for_kokoro

   # For base model (IPA output from legacy version)
   is_valid, invalid = validate_for_kokoro(phonemes, model="1.0")

   # For v1.1-zh model (Zhuyin output from version 1.1)
   is_valid, invalid = validate_for_kokoro(phonemes, model="1.1")

Features
--------

* Jieba tokenization for Chinese word segmentation
* Pypinyin for pinyin conversion to IPA (legacy version)
* ZHFrontend with Zhuyin notation (version 1.1)
* Tone sandhi rules
* cn2an for number handling
* Chinese to Western punctuation mapping