Core API

This module contains the core functionality of kokorog2p.

Main Functions

kokorog2p.phonemize(text: str, language: str = 'en-us', *, overrides: list[OverrideSpan] | None = None, return_ids: bool = True, return_phonemes: bool = True, alignment: Literal['span', 'legacy'] = 'span', overlap: Literal['snap', 'strict'] = 'snap', use_normalizer_rules: bool = True, use_espeak_fallback: bool = True, use_goruut_fallback: bool = False, use_cli: bool = False, use_spacy: bool = True, spacy_model: str | None = None, backend: Literal['kokorog2p', 'espeak', 'goruut'] = 'kokorog2p', g2p: G2PBase | None = None) → PhonemizeResult[source]

Phonemize text using the unified kokorog2p pipeline.

This is the primary public entry point for turning text into phonemes (and optionally tokens or token IDs) in a consistent way.

Internally, this function delegates to the same implementation used by span-based override phonemization (the former phonemize_to_result path), ensuring that:

The phoneme string returned here is identical to the one in the returned PhonemizeResult.
Tokenization and character offsets are deterministic and match the phoneme output.
Kokoro-model vocabulary validation/filtering is applied when producing token IDs (and when necessary to make the phoneme string ID-safe).

Args:

text:

Input text to phonemize. This should be plain text (no markup). Punctuation may be normalized (e.g. ... → …, - → —) to match Kokoro-compatible forms.

language:

Language code (e.g. "en-us", "en-gb", "de", "fr"). Used both for tokenization/alignment and for constructing a default G2P instance when g2p is not provided.

overrides:

Optional span-based overrides applied by character offsets. Overrides can inject phonemes ({"ph": "…"}) and/or change the language of a span ({"lang": "de"}) for that region.

return_ids:

Whether to include token IDs in the returned result.

return_phonemes:

Whether to include the phoneme string in the returned result.

alignment:

Alignment mode for applying overrides and token offsets:

"span" (default): deterministic offset-based alignment using tokenize().
"legacy": backward-compatible alignment based on the backend’s own tokenization. This may differ slightly across backends and languages.

overlap:

How to handle overrides that partially overlap a token boundary:

"snap" (default): apply to intersecting tokens and emit a warning when boundaries only partially overlap.
"strict": skip partial overlaps and emit a warning.

use_normalizer_rules:

Whether to apply language normalizer rules when building the internal alignment text used for span mapping.

use_espeak_fallback:

When constructing a G2P instance for the dictionary-based "kokorog2p" backend, fall back to eSpeak for out-of-vocabulary words. Ignored if g2p is provided.

use_goruut_fallback:

When constructing a G2P instance for the dictionary-based "kokorog2p" backend, fall back to goru·ut for out-of-vocabulary words. Ignored if g2p is provided.

use_spacy:

When constructing a G2P instance, whether to use spaCy for tokenization/POS tagging (English/French and optionally German). For Chinese/Japanese/Korean, this flag is accepted for API consistency but currently does not alter backend behavior. Ignored if g2p is provided.

spacy_model:

Optional spaCy model package to use when constructing a G2P instance (e.g. "en_core_web_sm", "fr_core_news_md", "de_core_news_md"). If omitted, language defaults are used. For Chinese/Japanese/Korean, accepted for API consistency but not currently used by native backends.

backend:

When constructing a G2P instance, select the backend: "kokorog2p", "espeak", or "goruut". Ignored if g2p is provided.

g2p:

Optional pre-created G2P instance to reuse across calls (useful for caching/performance). If provided, this function will use it directly and will NOT call get_g2p() (so backend and the fallback/spaCy construction flags are ignored for this call).

Returns:

A PhonemizeResult containing tokens, phonemes, token_ids, and warnings (depending on return_* flags).

Examples:

Basic phonemization:

>>> phonemize("Hello world!", language="en-us").phonemes
'h…'

Token IDs (model-ready):

>>> phonemize("Hello world!").token_ids
[ ... ]

Reusing a cached G2P instance:

>>> g2p = get_g2p(language="en-us")
>>> phonemize("Hello world!", g2p=g2p).phonemes
'h…'

Full traceable result (tokens + warnings):

>>> span = [OverrideSpan(6, 10, {"lang": "de"})]
>>> r = phonemize("Hello Welt!", overrides=span)
>>> r.tokens[1].lang
'de'

kokorog2p.tokenize(text: str, language: str = 'en-us', *, keep_punct: bool = True) → list[TokenSpan][source]

Convert text to a list of tokens with phonemes.

Args:

text: Input text to convert. language: Language code (e.g., ‘en-us’, ‘en-gb’). keep_punct: Whether to include punctuation tokens.

Returns:

List of TokenSpan objects with char offsets.

Example:

>>> tokens = tokenize("Hello world!", language="en-us")
>>> for t in tokens:
...     print(f"{t.text} [{t.char_start}:{t.char_end}]")
Hello [0:5]
world [6:11]
! [11:12]

kokorog2p.get_g2p(language: str = 'en-us', use_espeak_fallback: bool = True, use_goruut_fallback: bool = False, use_cli: bool = False, use_spacy: bool = True, backend: Literal['kokorog2p', 'espeak', 'goruut'] = 'kokorog2p', load_silver: bool = True, load_gold: bool = True, version: str = '1.0', phoneme_quotes: str = 'curly', strict: bool = True, spacy_model: str | None = None, **kwargs: Any) → G2PBase[source]

Get a G2P instance for the specified language.

This factory function returns an appropriate G2P instance based on the language code. Results are cached for efficiency. For mixed-language text, use preprocess_multilang to generate OverrideSpan objects for phonemize_to_result.

Args:

language: Language code (e.g., ‘en-us’, ‘en-gb’, ‘zh’, ‘ja’, ‘fr’, etc.). use_espeak_fallback: Whether to use espeak for out-of-vocabulary words

when using the dictionary-based “kokorog2p” backend. Ignored when backend is set to “espeak” (espeak is the primary backend).

use_goruut_fallback: Whether to use goruut for out-of-vocabulary words

when using the dictionary-based “kokorog2p” backend. Ignored when backend is set to “goruut” (goruut is the primary backend).

use_spacy: Whether to use spaCy for tokenization and POS tagging

(applies to English/French and optionally German). For Chinese/Japanese/Korean, this flag is currently accepted for API consistency but not used by their native pipelines. Used by the “kokorog2p” backend.

spacy_model: Optional spaCy model package name to override the language

default (e.g., "en_core_web_sm", "en_core_web_md", "fr_core_news_md", "de_core_news_md"). If None, each language G2P class uses its built-in default. For Chinese/Japanese/ Korean, this parameter is accepted for API consistency but currently does not alter backend behavior.

use_cli: If True, force use of CLI espeak phonemizer instead of

library bindings. Only applies when backend=”espeak”.

backend: Phonemization backend to use: “kokorog2p”, “espeak”, “goruut”.

The goruut backend requires pygoruut to be installed.

load_silver: If True, load silver tier dictionary (~100k extra entries).

Defaults to True for backward compatibility and maximum coverage. Set to False to save memory (~22-31 MB) and initialization time. Only applies to English (en-us, en-gb). Other languages reserve this parameter for future use.

load_gold: If True, load gold tier dictionary (~170k common words).

Defaults to True for maximum quality and coverage. Set to False when only silver tier or no dictionaries needed. Only applies to languages with dictionaries (English, French, German).

version: Model version to use. Default: “1.0” (base model).

“1.0”: Base model
“1.1”: Chinese/English model

Different languages may have different behavior: - Chinese: “1.0” = IPA output, “1.1” = Zhuyin output

phoneme_quotes: Quote character style in phoneme output. Options:

“curly”: Use curly quotes (”, “) - default, backward compatible
“ascii”: Use ASCII double quotes (“)
“none”: Remove quote characters from phoneme output

Only applies to English currently.

strict: If True (default), raise exceptions when backend initialization

or phonemization fails. If False, log errors and return empty results for backward compatibility with older versions that silently failed. Recommended: True for production use to catch configuration issues.

**kwargs: Additional arguments passed to the G2P constructor.

Returns:

A G2PBase instance for the specified language.

Raises:

ValueError: If the language is not supported and no fallback is available,: or if version is not “1.0” or “1.1”.

ImportError: If backend=”goruut” but pygoruut is not installed.

Example:

>>> g2p = get_g2p("en-us")
>>> tokens = g2p("Hello world!")
>>> # Disable silver for better performance
>>> g2p_fast = get_g2p("en-us", load_silver=False)
>>> # Ultra-fast initialization with no dictionaries
>>> g2p_minimal = get_g2p("en-us", load_silver=False, load_gold=False)
>>> # Chinese
>>> g2p_zh = get_g2p("zh")
>>> # Japanese
>>> g2p_ja = get_g2p("ja")
>>> # French (uses espeak fallback)
>>> g2p_fr = get_g2p("fr")
>>> # Using goruut backend
>>> g2p_goruut = get_g2p("en-us", backend="goruut")

kokorog2p.clear_cache() → None[source]

Clear the G2P instance cache.

This can be useful when you need to free memory or reset state.

kokorog2p.reset_abbreviations() → None[source]: Reset abbreviation expanders to their default state.

Base Classes

G2PBase

class kokorog2p.G2PBase(language: str = 'en-us', use_espeak_fallback: bool = True, use_goruut_fallback: bool = False, use_cli: bool = False, strict: bool = True)[source]

Bases: ABC

Abstract base class for grapheme-to-phoneme converters.

Subclasses must implement the __call__ method to convert text to phonemes.

__init__(language: str = 'en-us', use_espeak_fallback: bool = True, use_goruut_fallback: bool = False, use_cli: bool = False, strict: bool = True) → None[source]

Initialize the G2P converter.

Args:: language: Language code (e.g., ‘en-us’, ‘en-gb’). use_espeak_fallback: Whether to use espeak for OOV words. use_goruut_fallback: Whether to use goruut for OOV words. use_cli: If True, use CLI phonemizer instead of library bindings. strict: If True, raise exceptions on errors. If False, log warnings

and return empty results (backward compatible mode).

load_silver: bool | None

load_gold: bool | None

property is_british: bool: Check if this is British English.

abstractmethod __call__(text: str) → list[GToken][source]

Convert text to a list of tokens with phonemes.

Args:: text: Input text to convert.
Returns:: List of GToken objects with phonemes assigned.

phonemize(text: str) → str[source]

Convert text to a phoneme string.

This is a convenience method that calls __call__ and joins the results.

Args:: text: Input text to convert.
Returns:: Phoneme string with word boundaries.

word_to_phonemes(word: str, tag: str | None = None) → str | None[source]

Convert a single word to phonemes.

Args:: word: The word to convert. tag: Optional POS tag for disambiguation.
Returns:: Phoneme string or None if conversion failed.

abstractmethod lookup(word: str, tag: str | None = None) → str | None[source]

Look up a word in the dictionary.

Args:: word: The word to look up. tag: Optional POS tag for disambiguation.
Returns:: Phoneme string or None if not found.

add_abbreviation(abbreviation: str, expansion: str | dict[str, str], description: str = '', case_sensitive: bool = False) → None[source]: Add or update a custom abbreviation (if supported).

remove_abbreviation(abbreviation: str, case_sensitive: bool = False) → bool[source]: Remove an abbreviation (if supported).

has_abbreviation(abbreviation: str, case_sensitive: bool = False) → bool[source]: Check if an abbreviation exists (if supported).

list_abbreviations() → list[str][source]: List abbreviations (if supported).

__repr__() → str[source]: Return a string representation.

GToken

class kokorog2p.GToken(text: str, tag: str = '', whitespace: str = ' ', phonemes: str | None = None, start_ts: float | None = None, end_ts: float | None = None, rating: str | None = None, _: dict[str, ~typing.Any]=<factory>)[source]

Bases: object

A token representing a word or text unit with optional phoneme information.

Attributes:: text: The original text of the token. tag: Part-of-speech tag (e.g., ‘NN’, ‘VB’, ‘JJ’). whitespace: Trailing whitespace after this token. phonemes: The phonemic transcription of the token. start_ts: Start timestamp for audio alignment. end_ts: End timestamp for audio alignment. rating: Quality rating of the phoneme transcription. _: Extension dictionary for custom attributes.

text: The original text of this token.

phonemes: The IPA phoneme string for this token.

tag: Part-of-speech tag (if available).

whitespace: Whitespace following this token.

text: str

tag: str = ''

whitespace: str = ' '

phonemes: str | None = None

start_ts: float | None = None

end_ts: float | None = None

rating: str | None = None

property has_phonemes: bool: Check if this token has phonemes assigned.

property is_punctuation: bool: Check if this token is punctuation.

property is_word: bool: Check if this token is a word (not punctuation or whitespace).

get(key: str, default: Any = None) → Any[source]: Get a custom attribute from the extension dict.

set(key: str, value: Any) → None[source]: Set a custom attribute in the extension dict.

copy() → GToken[source]: Create a shallow copy of this token.

__repr__() → str[source]: Return a string representation of the token.

Phoneme Utilities

Vocabulary

kokorog2p.get_vocab(british: bool = False) → frozenset[str][source]

Get the phoneme vocabulary for a dialect.

Args:: british: Whether to get British or US vocabulary.
Returns:: Frozen set of valid phonemes.

kokorog2p.validate_phonemes(phonemes: str, british: bool = False) → bool[source]

Check if all phonemes in a string are valid Kokoro phonemes.

Args:: phonemes: Phoneme string to validate. british: Whether to validate against British or US vocabulary.
Returns:: True if all phonemes are valid.

kokorog2p.US_VOCAB: Build an immutable unordered collection of unique elements.

kokorog2p.GB_VOCAB: Build an immutable unordered collection of unique elements.

kokorog2p.VOWELS: Build an immutable unordered collection of unique elements.

kokorog2p.CONSONANTS: Build an immutable unordered collection of unique elements.

Conversion

kokorog2p.from_espeak(phonemes: str, british: bool = False) → str[source]

Convert espeak IPA output to Kokoro phonemes.

Args:

phonemes: The espeak phoneme string (with tie character ^ or ͡). british: Whether to use British English mappings.

Returns:

Kokoro-compatible phoneme string.

Example:

>>> from_espeak("mˈɜːt͡ʃənt͡ʃˌɪp", british=False)
'mˈɜɹʧəntʃˌɪp'

kokorog2p.from_goruut(phonemes: str, british: bool = False) → str[source]

Convert goruut/pygoruut IPA output to Kokoro phonemes.

Goruut outputs standard IPA without tie characters for diphthongs and affricates, which requires different handling than espeak.

Args:

phonemes: The goruut phoneme string (standard IPA). british: Whether to use British English mappings.

Returns:

Kokoro-compatible phoneme string.

Example:

>>> from_goruut("həlˈoʊ wˈɜɹld", british=False)
'həlˈO wˈɜɹld'
>>> from_goruut("sˈeɪ", british=False)
'sˈA'

kokorog2p.to_espeak(phonemes: str) → str[source]

Convert Kokoro phonemes to standard IPA (espeak-compatible).

Args:

phonemes: Kokoro phoneme string.

Returns:

Standard IPA phoneme string.

Example:

>>> to_espeak("hˈA")
'hˈeɪ'

Kokoro Vocabulary

Encoding/Decoding

kokorog2p.encode(text: str, add_spaces: bool = True, model: str = '1.0') → list[int][source]

Convert a phoneme string to token indices.

Args:

text: Phoneme string to encode. add_spaces: Whether to include space tokens (default True). model: Model variant to encode for (default: “1.0”).

Returns:

List of token indices.

Example:

>>> encode("hˈɛlO")
[50, 156, 86, 54, 31]

kokorog2p.decode(indices: list[int], skip_special: bool = True, model: str = '1.0') → str[source]

Convert token indices back to a phoneme string.

Args:

indices: List of token indices. skip_special: Whether to skip padding/unknown tokens. model: Model variant to decode from (default: “1.0”).

Returns:

Phoneme string.

Example:

>>> decode([50, 156, 86, 54, 31])
'hˈɛlO'

kokorog2p.phonemes_to_ids(phonemes: str, model: str = '1.0') → list[int][source]

Convert phoneme string to model input IDs.

This is the main function used to prepare text for the Kokoro model.

Args:

phonemes: Phoneme string from G2P conversion. model: Model variant to encode for (default: “1.0”).

Returns:

List of token IDs ready for model input.

Example:

>>> phonemes_to_ids("hˈɛlO wˈɜɹld!")
[50, 156, 86, 54, 31, 16, 65, 156, 87, 123, 54, 46, 5]

kokorog2p.ids_to_phonemes(ids: list[int], model: str = '1.0') → str[source]

Convert model output IDs back to phoneme string.

Args:

ids: List of token IDs from model. model: Model variant to decode from (default: “1.0”).

Returns:

Phoneme string.

Example:

>>> ids_to_phonemes([50, 156, 86, 54, 31])
'hˈɛlO'

Validation

kokorog2p.validate_for_kokoro(text: str, model: str = '1.0') → tuple[bool, list[str]][source]

Validate that all characters in text are in Kokoro vocabulary.

Args:

text: Phoneme string to validate. model: Model variant to validate against:

“1.0”: Base multilingual model (default)

“1.1”: Chinese-specific model with Zhuyin

Returns:

Tuple of (is_valid, list_of_invalid_chars).

Example:

>>> validate_for_kokoro("hˈɛlO")
(True, [])
>>> validate_for_kokoro("hˈɛlO§")
(False, ['§'])
>>> validate_for_kokoro("ㄋㄧ2ㄏㄠ3", model="1.1")
(True, [])

kokorog2p.filter_for_kokoro(text: str, replacement: str = '', model: str = '1.0') → str[source]

Remove characters not in Kokoro vocabulary.

Args:

text: Phoneme string to filter. replacement: String to replace invalid characters with. model: Model variant to filter for (same options as validate_for_kokoro).

Returns:

Filtered phoneme string.

Example:

>>> filter_for_kokoro("hˈɛlO§")
'hˈɛlO'

Configuration

kokorog2p.get_kokoro_vocab(model: str = '1.0') → dict[str, int]

Get the Kokoro vocabulary mapping (token -> index).

Args:

model: Model variant to load vocab for:

“1.0”: Base multilingual Kokoro model (default)
“1.1”: Chinese-specific Kokoro v1.1 model with Zhuyin

Returns:

Dictionary mapping tokens to their indices.

kokorog2p.get_kokoro_config() → dict

Get the full Kokoro model configuration.

Returns:: Dictionary containing the full model config.

kokorog2p.N_TOKENS

int([x]) -> integer int(x, base=10) -> integer

Convert a number or string to an integer, or return 0 if no arguments are given. If x is a number, return x.__int__(). For floating-point numbers, this truncates towards zero.

If x is not a number or if base is given, then x must be a string, bytes, or bytearray instance representing an integer literal in the given base. The literal can be preceded by ‘+’ or ‘-’ and be surrounded by whitespace. The base defaults to 10. Valid bases are 0 and 2-36. Base 0 means to interpret the base from the string as an integer literal. >>> int(‘0b100’, base=0) 4

kokorog2p.PAD_IDX

int([x]) -> integer int(x, base=10) -> integer

Convert a number or string to an integer, or return 0 if no arguments are given. If x is a number, return x.__int__(). For floating-point numbers, this truncates towards zero.

If x is not a number or if base is given, then x must be a string, bytes, or bytearray instance representing an integer literal in the given base. The literal can be preceded by ‘+’ or ‘-’ and be surrounded by whitespace. The base defaults to 10. Valid bases are 0 and 2-36. Base 0 means to interpret the base from the string as an integer literal. >>> int(‘0b100’, base=0) 4

Punctuation

class kokorog2p.Punctuation(marks: str | Pattern = ';:,.!?—…"()“”')[source]

Bases: object

Preserve, remove, or normalize punctuation during phonemization.

This class provides methods to: 1. Normalize Unicode punctuation to Kokoro-compatible marks 2. remove configured marks 3. Preserve punctuation positions for later restoration

Examples:

>>> punct = Punctuation()

# Normalize Unicode punctuation >>> punct.normalize(“Hello… world！”) ‘Hello… world!’

# Remove all punctuation >>> punct.remove(“Hello, world!”) ‘Hello world’

# Preserve and restore >>> text, marks = punct.preserve(“Hello, world!”) >>> text [‘Hello’, ‘world’] >>> # After phonemization… >>> punct.restore([‘həˈloʊ’, ‘wˈɜːld’], marks) [‘həˈloʊ, wˈɜːld!’]

__init__(marks: str | Pattern = ';:,.!?—…"()“”')[source]

Initialize punctuation handler.

Args:

marks: Punctuation marks to consider. Either a string of: single-character marks or a compiled regex pattern.

static default_marks() → str[source]: Return the default punctuation marks.

static kokoro_marks() → frozenset[str][source]: Return all punctuation marks in Kokoro’s vocabulary.

property marks: str: The punctuation marks as a string.

normalize(text: str) → str[source]

Normalize Unicode punctuation to Kokoro-compatible equivalents.

Args:

text: Input text with various Unicode punctuation.

Returns:

Text with normalized punctuation.

Examples:

>>> punct = Punctuation()
>>> punct.normalize("Hello… world！")
'Hello… world!'
>>> punct.normalize('"Hello," she said.')
'"Hello," she said.'
>>> punct.normalize("Wait...what?!")
'Wait…what?!'
>>> punct.normalize("don't worry")
"don't worry"
>>> punct.normalize("Wait - now")
'Wait — now'

remove(text: str | list[str]) → str | list[str][source]

Remove all punctuation marks, replacing with spaces.

Args:

text: Input text or list of texts.

Returns:

Text(s) with punctuation replaced by spaces.

Examples:

>>> punct = Punctuation()
>>> punct.remove("Hello, world!")
'Hello world'
>>> punct.remove(["Hello!", "How are you?"])
['Hello', 'How are you']

preserve(text: str | list[str]) → tuple[list[str], list[MarkIndex]][source]

Extract punctuation from text, preserving positions for restoration.

This splits the text into chunks without punctuation, while recording where each punctuation mark was located.

Args:

text: Input text or list of texts.

Returns:

Tuple of (text_chunks, mark_indices) where: - text_chunks: List of text segments without punctuation - mark_indices: List of MarkIndex objects for restoration

Examples:

>>> punct = Punctuation()
>>> text, marks = punct.preserve('Hello, world!')
>>> text
['Hello', 'world']
>>> [(m.mark, m.position.value) for m in marks]
[(', ', 'I'), ('!', 'E')]

classmethod restore(text: str | list[str], marks: list[MarkIndex], word_sep: str = ' ', strip: bool = True) → list[str][source]

Restore punctuation to phonemized text.

This is the reverse of preserve(). It takes phonemized text chunks and reinserts the punctuation marks at their original positions.

Args:

text: Phonemized text chunks. marks: Mark indices from preserve(). word_sep: Word separator used in phonemized output. strip: Whether to strip trailing separators.

Returns:

List of phonemized text with punctuation restored.

Examples:

>>> punct = Punctuation()
>>> text, marks = punct.preserve('Hello, world!')
>>> punct.restore(['həˈloʊ', 'wˈɜːld'], marks)
['həˈloʊ, wˈɜːld!']

kokorog2p.normalize_punctuation(text: str) → str[source]

Normalize Unicode punctuation to Kokoro-compatible equivalents.

This is a convenience function that creates a Punctuation instance and calls normalize().

Args:

text: Input text with various Unicode punctuation.

Returns:

Text with normalized punctuation.

Examples:

>>> normalize_punctuation("Hello… world！")
'Hello… world!'

kokorog2p.filter_punctuation(text: str) → str[source]

Keep only Kokoro-supported punctuation, remove everything else.

Args:

text: Input text.

Returns:

Text with only Kokoro-supported punctuation.

Examples:

>>> filter_punctuation("Hello~world!")
'Hello world!'

kokorog2p.is_kokoro_punctuation(char: str) → bool[source]

Check if a character is a Kokoro-supported punctuation mark.

Args:: char: Single character to check.
Returns:: True if the character is in Kokoro’s punctuation vocabulary.

kokorog2p.KOKORO_PUNCTUATION: Build an immutable unordered collection of unique elements.

Word Mismatch Detection

class kokorog2p.MismatchMode(*values)[source]

Bases: Enum

How to handle word count mismatches.

IGNORE = 'ignore'

WARN = 'warn'

REMOVE = 'remove'

class kokorog2p.MismatchInfo(line_num: int, expected: int, actual: int, input_text: str = '', output_text: str = '')[source]

Bases: object

Information about a word count mismatch.

line_num: int

expected: int

actual: int

input_text: str = ''

output_text: str = ''

class kokorog2p.MismatchStats(total_lines: int, mismatched_lines: int, mismatches: list[MismatchInfo])[source]

Bases: object

Statistics about word count mismatches.

total_lines: int

mismatched_lines: int

mismatches: list[MismatchInfo]

property mismatch_rate: float: Percentage of lines with mismatches.

kokorog2p.detect_mismatches(input_texts: list[str], output_texts: list[str], input_separator: str | Pattern[str] = re.compile('\\s+'), output_separator: str | Pattern[str] = re.compile('\\s+'), store_texts: bool = False) → MismatchStats[source]

Detect word count mismatches between input and output.

Args:

input_texts: Original input texts. output_texts: Phonemized output texts. input_separator: Word separator for input. output_separator: Word separator for output. store_texts: Whether to store input/output in MismatchInfo.

Returns:

MismatchStats with details about any mismatches.

Raises:

ValueError: If input and output have different lengths.

Examples:

>>> inputs = ["hello world", "one two three"]
>>> outputs = ["həˈloʊ wˈɜːld", "wˈʌn tuː θɹiː fɔːɹ"]  # Extra word!
>>> stats = detect_mismatches(inputs, outputs)
>>> stats.mismatched_lines
1
>>> stats.mismatches[0].line_num
1

kokorog2p.check_word_alignment(input_texts: list[str], output_texts: list[str], mode: MismatchMode | str = MismatchMode.WARN, input_separator: str | Pattern[str] = re.compile('\\s+'), output_separator: str | Pattern[str] = re.compile('\\s+'), logger: Logger | None = None) → tuple[list[str], MismatchStats][source]

Check word alignment between input and output, optionally fixing issues.

This is a convenience function that combines detection and processing.

Args:

input_texts: Original input texts. output_texts: Phonemized output texts. mode: How to handle mismatches (ignore, warn, remove). input_separator: Word separator for input texts. output_separator: Word separator for output texts. logger: Logger instance.

Returns:

Tuple of (processed_outputs, statistics).

Examples:

>>> inputs = ["hello world", "good morning"]
>>> outputs = ["həˈloʊ wˈɜːld", "gʊd ˈmɔːnɪŋ ɛkstɹə"]
>>> result, stats = check_word_alignment(inputs, outputs, mode="warn")
>>> stats.mismatched_lines
1

kokorog2p.count_words(text: str, separator: str | Pattern[str] = re.compile('\\s+')) → int[source]

Count the number of words in text.

Args:

text: Text to count words in. separator: Word separator (string or regex pattern).

Returns:

Number of words.

Examples:

>>> count_words("hello world")
2
>>> count_words("hello  world")  # Multiple spaces
2
>>> count_words("")
0