Core API
This module contains the core functionality of kokorog2p.
Main Functions
- kokorog2p.phonemize(text: str, language: str = 'en-us', *, overrides: list[OverrideSpan] | None = None, return_ids: bool = True, return_phonemes: bool = True, alignment: Literal['span', 'legacy'] = 'span', overlap: Literal['snap', 'strict'] = 'snap', use_normalizer_rules: bool = True, use_espeak_fallback: bool = True, use_goruut_fallback: bool = False, use_cli: bool = False, use_spacy: bool = True, spacy_model: str | None = None, backend: Literal['kokorog2p', 'espeak', 'goruut'] = 'kokorog2p', g2p: G2PBase | None = None) PhonemizeResult[source]
Phonemize text using the unified kokorog2p pipeline.
This is the primary public entry point for turning text into phonemes (and optionally tokens or token IDs) in a consistent way.
Internally, this function delegates to the same implementation used by span-based override phonemization (the former
phonemize_to_resultpath), ensuring that:The phoneme string returned here is identical to the one in the returned
PhonemizeResult.Tokenization and character offsets are deterministic and match the phoneme output.
Kokoro-model vocabulary validation/filtering is applied when producing token IDs (and when necessary to make the phoneme string ID-safe).
- Args:
- text:
Input text to phonemize. This should be plain text (no markup). Punctuation may be normalized (e.g.
...→…,-→—) to match Kokoro-compatible forms.- language:
Language code (e.g.
"en-us","en-gb","de","fr"). Used both for tokenization/alignment and for constructing a default G2P instance wheng2pis not provided.- overrides:
Optional span-based overrides applied by character offsets. Overrides can inject phonemes (
{"ph": "…"}) and/or change the language of a span ({"lang": "de"}) for that region.- return_ids:
Whether to include token IDs in the returned result.
- return_phonemes:
Whether to include the phoneme string in the returned result.
- alignment:
Alignment mode for applying overrides and token offsets:
"span"(default): deterministic offset-based alignment usingtokenize()."legacy": backward-compatible alignment based on the backend’s own tokenization. This may differ slightly across backends and languages.
- overlap:
How to handle overrides that partially overlap a token boundary:
"snap"(default): apply to intersecting tokens and emit a warning when boundaries only partially overlap."strict": skip partial overlaps and emit a warning.
- use_normalizer_rules:
Whether to apply language normalizer rules when building the internal alignment text used for span mapping.
- use_espeak_fallback:
When constructing a G2P instance for the dictionary-based
"kokorog2p"backend, fall back to eSpeak for out-of-vocabulary words. Ignored ifg2pis provided.- use_goruut_fallback:
When constructing a G2P instance for the dictionary-based
"kokorog2p"backend, fall back to goru·ut for out-of-vocabulary words. Ignored ifg2pis provided.- use_spacy:
When constructing a G2P instance, whether to use spaCy for tokenization/POS tagging (English/French and optionally German). For Chinese/Japanese/Korean, this flag is accepted for API consistency but currently does not alter backend behavior. Ignored if
g2pis provided.- spacy_model:
Optional spaCy model package to use when constructing a G2P instance (e.g.
"en_core_web_sm","fr_core_news_md","de_core_news_md"). If omitted, language defaults are used. For Chinese/Japanese/Korean, accepted for API consistency but not currently used by native backends.- backend:
When constructing a G2P instance, select the backend:
"kokorog2p","espeak", or"goruut". Ignored ifg2pis provided.- g2p:
Optional pre-created G2P instance to reuse across calls (useful for caching/performance). If provided, this function will use it directly and will NOT call
get_g2p()(sobackendand the fallback/spaCy construction flags are ignored for this call).
- Returns:
A
PhonemizeResultcontaining tokens, phonemes, token_ids, and warnings (depending onreturn_*flags).- Examples:
Basic phonemization:
>>> phonemize("Hello world!", language="en-us").phonemes 'h…'
Token IDs (model-ready):
>>> phonemize("Hello world!").token_ids [ ... ]
Reusing a cached G2P instance:
>>> g2p = get_g2p(language="en-us") >>> phonemize("Hello world!", g2p=g2p).phonemes 'h…'
Full traceable result (tokens + warnings):
>>> span = [OverrideSpan(6, 10, {"lang": "de"})] >>> r = phonemize("Hello Welt!", overrides=span) >>> r.tokens[1].lang 'de'
- kokorog2p.tokenize(text: str, language: str = 'en-us', *, keep_punct: bool = True) list[TokenSpan][source]
Convert text to a list of tokens with phonemes.
- Args:
text: Input text to convert. language: Language code (e.g., ‘en-us’, ‘en-gb’). keep_punct: Whether to include punctuation tokens.
- Returns:
List of TokenSpan objects with char offsets.
- Example:
>>> tokens = tokenize("Hello world!", language="en-us") >>> for t in tokens: ... print(f"{t.text} [{t.char_start}:{t.char_end}]") Hello [0:5] world [6:11] ! [11:12]
- kokorog2p.get_g2p(language: str = 'en-us', use_espeak_fallback: bool = True, use_goruut_fallback: bool = False, use_cli: bool = False, use_spacy: bool = True, backend: Literal['kokorog2p', 'espeak', 'goruut'] = 'kokorog2p', load_silver: bool = True, load_gold: bool = True, version: str = '1.0', phoneme_quotes: str = 'curly', strict: bool = True, spacy_model: str | None = None, **kwargs: Any) G2PBase[source]
Get a G2P instance for the specified language.
This factory function returns an appropriate G2P instance based on the language code. Results are cached for efficiency. For mixed-language text, use preprocess_multilang to generate OverrideSpan objects for phonemize_to_result.
- Args:
language: Language code (e.g., ‘en-us’, ‘en-gb’, ‘zh’, ‘ja’, ‘fr’, etc.). use_espeak_fallback: Whether to use espeak for out-of-vocabulary words
when using the dictionary-based “kokorog2p” backend. Ignored when backend is set to “espeak” (espeak is the primary backend).
- use_goruut_fallback: Whether to use goruut for out-of-vocabulary words
when using the dictionary-based “kokorog2p” backend. Ignored when backend is set to “goruut” (goruut is the primary backend).
- use_spacy: Whether to use spaCy for tokenization and POS tagging
(applies to English/French and optionally German). For Chinese/Japanese/Korean, this flag is currently accepted for API consistency but not used by their native pipelines. Used by the “kokorog2p” backend.
- spacy_model: Optional spaCy model package name to override the language
default (e.g.,
"en_core_web_sm","en_core_web_md","fr_core_news_md","de_core_news_md"). If None, each language G2P class uses its built-in default. For Chinese/Japanese/ Korean, this parameter is accepted for API consistency but currently does not alter backend behavior.- use_cli: If True, force use of CLI espeak phonemizer instead of
library bindings. Only applies when backend=”espeak”.
- backend: Phonemization backend to use: “kokorog2p”, “espeak”, “goruut”.
The goruut backend requires pygoruut to be installed.
- load_silver: If True, load silver tier dictionary (~100k extra entries).
Defaults to True for backward compatibility and maximum coverage. Set to False to save memory (~22-31 MB) and initialization time. Only applies to English (en-us, en-gb). Other languages reserve this parameter for future use.
- load_gold: If True, load gold tier dictionary (~170k common words).
Defaults to True for maximum quality and coverage. Set to False when only silver tier or no dictionaries needed. Only applies to languages with dictionaries (English, French, German).
- version: Model version to use. Default: “1.0” (base model).
“1.0”: Base model
“1.1”: Chinese/English model
Different languages may have different behavior: - Chinese: “1.0” = IPA output, “1.1” = Zhuyin output
- phoneme_quotes: Quote character style in phoneme output. Options:
“curly”: Use curly quotes (”, “) - default, backward compatible
“ascii”: Use ASCII double quotes (“)
“none”: Remove quote characters from phoneme output
Only applies to English currently.
- strict: If True (default), raise exceptions when backend initialization
or phonemization fails. If False, log errors and return empty results for backward compatibility with older versions that silently failed. Recommended: True for production use to catch configuration issues.
**kwargs: Additional arguments passed to the G2P constructor.
- Returns:
A G2PBase instance for the specified language.
- Raises:
- ValueError: If the language is not supported and no fallback is available,
or if version is not “1.0” or “1.1”.
ImportError: If backend=”goruut” but pygoruut is not installed.
- Example:
>>> g2p = get_g2p("en-us") >>> tokens = g2p("Hello world!") >>> # Disable silver for better performance >>> g2p_fast = get_g2p("en-us", load_silver=False) >>> # Ultra-fast initialization with no dictionaries >>> g2p_minimal = get_g2p("en-us", load_silver=False, load_gold=False) >>> # Chinese >>> g2p_zh = get_g2p("zh") >>> # Japanese >>> g2p_ja = get_g2p("ja") >>> # French (uses espeak fallback) >>> g2p_fr = get_g2p("fr") >>> # Using goruut backend >>> g2p_goruut = get_g2p("en-us", backend="goruut")
Base Classes
G2PBase
- class kokorog2p.G2PBase(language: str = 'en-us', use_espeak_fallback: bool = True, use_goruut_fallback: bool = False, use_cli: bool = False, strict: bool = True)[source]
Bases:
ABCAbstract base class for grapheme-to-phoneme converters.
Subclasses must implement the __call__ method to convert text to phonemes.
- __init__(language: str = 'en-us', use_espeak_fallback: bool = True, use_goruut_fallback: bool = False, use_cli: bool = False, strict: bool = True) None[source]
Initialize the G2P converter.
- Args:
language: Language code (e.g., ‘en-us’, ‘en-gb’). use_espeak_fallback: Whether to use espeak for OOV words. use_goruut_fallback: Whether to use goruut for OOV words. use_cli: If True, use CLI phonemizer instead of library bindings. strict: If True, raise exceptions on errors. If False, log warnings
and return empty results (backward compatible mode).
- abstractmethod __call__(text: str) list[GToken][source]
Convert text to a list of tokens with phonemes.
- Args:
text: Input text to convert.
- Returns:
List of GToken objects with phonemes assigned.
- phonemize(text: str) str[source]
Convert text to a phoneme string.
This is a convenience method that calls __call__ and joins the results.
- Args:
text: Input text to convert.
- Returns:
Phoneme string with word boundaries.
- word_to_phonemes(word: str, tag: str | None = None) str | None[source]
Convert a single word to phonemes.
- Args:
word: The word to convert. tag: Optional POS tag for disambiguation.
- Returns:
Phoneme string or None if conversion failed.
- abstractmethod lookup(word: str, tag: str | None = None) str | None[source]
Look up a word in the dictionary.
- Args:
word: The word to look up. tag: Optional POS tag for disambiguation.
- Returns:
Phoneme string or None if not found.
- add_abbreviation(abbreviation: str, expansion: str | dict[str, str], description: str = '', case_sensitive: bool = False) None[source]
Add or update a custom abbreviation (if supported).
- remove_abbreviation(abbreviation: str, case_sensitive: bool = False) bool[source]
Remove an abbreviation (if supported).
GToken
- class kokorog2p.GToken(text: str, tag: str = '', whitespace: str = ' ', phonemes: str | None = None, start_ts: float | None = None, end_ts: float | None = None, rating: str | None = None, _: dict[str, ~typing.Any]=<factory>)[source]
Bases:
objectA token representing a word or text unit with optional phoneme information.
- Attributes:
text: The original text of the token. tag: Part-of-speech tag (e.g., ‘NN’, ‘VB’, ‘JJ’). whitespace: Trailing whitespace after this token. phonemes: The phonemic transcription of the token. start_ts: Start timestamp for audio alignment. end_ts: End timestamp for audio alignment. rating: Quality rating of the phoneme transcription. _: Extension dictionary for custom attributes.
- text
The original text of this token.
- phonemes
The IPA phoneme string for this token.
- tag
Part-of-speech tag (if available).
- whitespace
Whitespace following this token.
Phoneme Utilities
Vocabulary
- kokorog2p.get_vocab(british: bool = False) frozenset[str][source]
Get the phoneme vocabulary for a dialect.
- Args:
british: Whether to get British or US vocabulary.
- Returns:
Frozen set of valid phonemes.
- kokorog2p.validate_phonemes(phonemes: str, british: bool = False) bool[source]
Check if all phonemes in a string are valid Kokoro phonemes.
- Args:
phonemes: Phoneme string to validate. british: Whether to validate against British or US vocabulary.
- Returns:
True if all phonemes are valid.
- kokorog2p.US_VOCAB
Build an immutable unordered collection of unique elements.
- kokorog2p.GB_VOCAB
Build an immutable unordered collection of unique elements.
- kokorog2p.VOWELS
Build an immutable unordered collection of unique elements.
- kokorog2p.CONSONANTS
Build an immutable unordered collection of unique elements.
Conversion
- kokorog2p.from_espeak(phonemes: str, british: bool = False) str[source]
Convert espeak IPA output to Kokoro phonemes.
- Args:
phonemes: The espeak phoneme string (with tie character ^ or ͡). british: Whether to use British English mappings.
- Returns:
Kokoro-compatible phoneme string.
- Example:
>>> from_espeak("mˈɜːt͡ʃənt͡ʃˌɪp", british=False) 'mˈɜɹʧəntʃˌɪp'
- kokorog2p.from_goruut(phonemes: str, british: bool = False) str[source]
Convert goruut/pygoruut IPA output to Kokoro phonemes.
Goruut outputs standard IPA without tie characters for diphthongs and affricates, which requires different handling than espeak.
- Args:
phonemes: The goruut phoneme string (standard IPA). british: Whether to use British English mappings.
- Returns:
Kokoro-compatible phoneme string.
- Example:
>>> from_goruut("həlˈoʊ wˈɜɹld", british=False) 'həlˈO wˈɜɹld' >>> from_goruut("sˈeɪ", british=False) 'sˈA'
Kokoro Vocabulary
Encoding/Decoding
- kokorog2p.encode(text: str, add_spaces: bool = True, model: str = '1.0') list[int][source]
Convert a phoneme string to token indices.
- Args:
text: Phoneme string to encode. add_spaces: Whether to include space tokens (default True). model: Model variant to encode for (default: “1.0”).
- Returns:
List of token indices.
- Example:
>>> encode("hˈɛlO") [50, 156, 86, 54, 31]
- kokorog2p.decode(indices: list[int], skip_special: bool = True, model: str = '1.0') str[source]
Convert token indices back to a phoneme string.
- Args:
indices: List of token indices. skip_special: Whether to skip padding/unknown tokens. model: Model variant to decode from (default: “1.0”).
- Returns:
Phoneme string.
- Example:
>>> decode([50, 156, 86, 54, 31]) 'hˈɛlO'
- kokorog2p.phonemes_to_ids(phonemes: str, model: str = '1.0') list[int][source]
Convert phoneme string to model input IDs.
This is the main function used to prepare text for the Kokoro model.
- Args:
phonemes: Phoneme string from G2P conversion. model: Model variant to encode for (default: “1.0”).
- Returns:
List of token IDs ready for model input.
- Example:
>>> phonemes_to_ids("hˈɛlO wˈɜɹld!") [50, 156, 86, 54, 31, 16, 65, 156, 87, 123, 54, 46, 5]
- kokorog2p.ids_to_phonemes(ids: list[int], model: str = '1.0') str[source]
Convert model output IDs back to phoneme string.
- Args:
ids: List of token IDs from model. model: Model variant to decode from (default: “1.0”).
- Returns:
Phoneme string.
- Example:
>>> ids_to_phonemes([50, 156, 86, 54, 31]) 'hˈɛlO'
Validation
- kokorog2p.validate_for_kokoro(text: str, model: str = '1.0') tuple[bool, list[str]][source]
Validate that all characters in text are in Kokoro vocabulary.
- Args:
text: Phoneme string to validate. model: Model variant to validate against:
“1.0”: Base multilingual model (default)
“1.1”: Chinese-specific model with Zhuyin
- Returns:
Tuple of (is_valid, list_of_invalid_chars).
- Example:
>>> validate_for_kokoro("hˈɛlO") (True, []) >>> validate_for_kokoro("hˈɛlO§") (False, ['§']) >>> validate_for_kokoro("ㄋㄧ2ㄏㄠ3", model="1.1") (True, [])
- kokorog2p.filter_for_kokoro(text: str, replacement: str = '', model: str = '1.0') str[source]
Remove characters not in Kokoro vocabulary.
- Args:
text: Phoneme string to filter. replacement: String to replace invalid characters with. model: Model variant to filter for (same options as validate_for_kokoro).
- Returns:
Filtered phoneme string.
- Example:
>>> filter_for_kokoro("hˈɛlO§") 'hˈɛlO'
Configuration
- kokorog2p.get_kokoro_vocab(model: str = '1.0') dict[str, int]
Get the Kokoro vocabulary mapping (token -> index).
- Args:
- model: Model variant to load vocab for:
“1.0”: Base multilingual Kokoro model (default)
“1.1”: Chinese-specific Kokoro v1.1 model with Zhuyin
- Returns:
Dictionary mapping tokens to their indices.
- kokorog2p.get_kokoro_config() dict
Get the full Kokoro model configuration.
- Returns:
Dictionary containing the full model config.
- kokorog2p.N_TOKENS
int([x]) -> integer int(x, base=10) -> integer
Convert a number or string to an integer, or return 0 if no arguments are given. If x is a number, return x.__int__(). For floating-point numbers, this truncates towards zero.
If x is not a number or if base is given, then x must be a string, bytes, or bytearray instance representing an integer literal in the given base. The literal can be preceded by ‘+’ or ‘-’ and be surrounded by whitespace. The base defaults to 10. Valid bases are 0 and 2-36. Base 0 means to interpret the base from the string as an integer literal. >>> int(‘0b100’, base=0) 4
- kokorog2p.PAD_IDX
int([x]) -> integer int(x, base=10) -> integer
Convert a number or string to an integer, or return 0 if no arguments are given. If x is a number, return x.__int__(). For floating-point numbers, this truncates towards zero.
If x is not a number or if base is given, then x must be a string, bytes, or bytearray instance representing an integer literal in the given base. The literal can be preceded by ‘+’ or ‘-’ and be surrounded by whitespace. The base defaults to 10. Valid bases are 0 and 2-36. Base 0 means to interpret the base from the string as an integer literal. >>> int(‘0b100’, base=0) 4
Punctuation
- class kokorog2p.Punctuation(marks: str | Pattern = ';:,.!?—…"()“”')[source]
Bases:
objectPreserve, remove, or normalize punctuation during phonemization.
This class provides methods to: 1. Normalize Unicode punctuation to Kokoro-compatible marks 2. remove configured marks 3. Preserve punctuation positions for later restoration
- Examples:
>>> punct = Punctuation()
# Normalize Unicode punctuation >>> punct.normalize(“Hello… world!”) ‘Hello… world!’
# Remove all punctuation >>> punct.remove(“Hello, world!”) ‘Hello world’
# Preserve and restore >>> text, marks = punct.preserve(“Hello, world!”) >>> text [‘Hello’, ‘world’] >>> # After phonemization… >>> punct.restore([‘həˈloʊ’, ‘wˈɜːld’], marks) [‘həˈloʊ, wˈɜːld!’]
- __init__(marks: str | Pattern = ';:,.!?—…"()“”')[source]
Initialize punctuation handler.
- Args:
- marks: Punctuation marks to consider. Either a string of
single-character marks or a compiled regex pattern.
- normalize(text: str) str[source]
Normalize Unicode punctuation to Kokoro-compatible equivalents.
- Args:
text: Input text with various Unicode punctuation.
- Returns:
Text with normalized punctuation.
- Examples:
>>> punct = Punctuation() >>> punct.normalize("Hello… world!") 'Hello… world!' >>> punct.normalize('"Hello," she said.') '"Hello," she said.' >>> punct.normalize("Wait...what?!") 'Wait…what?!' >>> punct.normalize("don't worry") "don't worry" >>> punct.normalize("Wait - now") 'Wait — now'
- remove(text: str | list[str]) str | list[str][source]
Remove all punctuation marks, replacing with spaces.
- Args:
text: Input text or list of texts.
- Returns:
Text(s) with punctuation replaced by spaces.
- Examples:
>>> punct = Punctuation() >>> punct.remove("Hello, world!") 'Hello world' >>> punct.remove(["Hello!", "How are you?"]) ['Hello', 'How are you']
- preserve(text: str | list[str]) tuple[list[str], list[MarkIndex]][source]
Extract punctuation from text, preserving positions for restoration.
This splits the text into chunks without punctuation, while recording where each punctuation mark was located.
- Args:
text: Input text or list of texts.
- Returns:
Tuple of (text_chunks, mark_indices) where: - text_chunks: List of text segments without punctuation - mark_indices: List of MarkIndex objects for restoration
- Examples:
>>> punct = Punctuation() >>> text, marks = punct.preserve('Hello, world!') >>> text ['Hello', 'world'] >>> [(m.mark, m.position.value) for m in marks] [(', ', 'I'), ('!', 'E')]
- classmethod restore(text: str | list[str], marks: list[MarkIndex], word_sep: str = ' ', strip: bool = True) list[str][source]
Restore punctuation to phonemized text.
This is the reverse of preserve(). It takes phonemized text chunks and reinserts the punctuation marks at their original positions.
- Args:
text: Phonemized text chunks. marks: Mark indices from preserve(). word_sep: Word separator used in phonemized output. strip: Whether to strip trailing separators.
- Returns:
List of phonemized text with punctuation restored.
- Examples:
>>> punct = Punctuation() >>> text, marks = punct.preserve('Hello, world!') >>> punct.restore(['həˈloʊ', 'wˈɜːld'], marks) ['həˈloʊ, wˈɜːld!']
- kokorog2p.normalize_punctuation(text: str) str[source]
Normalize Unicode punctuation to Kokoro-compatible equivalents.
This is a convenience function that creates a Punctuation instance and calls normalize().
- Args:
text: Input text with various Unicode punctuation.
- Returns:
Text with normalized punctuation.
- Examples:
>>> normalize_punctuation("Hello… world!") 'Hello… world!'
- kokorog2p.filter_punctuation(text: str) str[source]
Keep only Kokoro-supported punctuation, remove everything else.
- Args:
text: Input text.
- Returns:
Text with only Kokoro-supported punctuation.
- Examples:
>>> filter_punctuation("Hello~world!") 'Hello world!'
- kokorog2p.is_kokoro_punctuation(char: str) bool[source]
Check if a character is a Kokoro-supported punctuation mark.
- Args:
char: Single character to check.
- Returns:
True if the character is in Kokoro’s punctuation vocabulary.
- kokorog2p.KOKORO_PUNCTUATION
Build an immutable unordered collection of unique elements.
Word Mismatch Detection
- class kokorog2p.MismatchMode(*values)[source]
Bases:
EnumHow to handle word count mismatches.
- IGNORE = 'ignore'
- WARN = 'warn'
- REMOVE = 'remove'
- class kokorog2p.MismatchInfo(line_num: int, expected: int, actual: int, input_text: str = '', output_text: str = '')[source]
Bases:
objectInformation about a word count mismatch.
- class kokorog2p.MismatchStats(total_lines: int, mismatched_lines: int, mismatches: list[MismatchInfo])[source]
Bases:
objectStatistics about word count mismatches.
- mismatches: list[MismatchInfo]
- kokorog2p.detect_mismatches(input_texts: list[str], output_texts: list[str], input_separator: str | Pattern[str] = re.compile('\\s+'), output_separator: str | Pattern[str] = re.compile('\\s+'), store_texts: bool = False) MismatchStats[source]
Detect word count mismatches between input and output.
- Args:
input_texts: Original input texts. output_texts: Phonemized output texts. input_separator: Word separator for input. output_separator: Word separator for output. store_texts: Whether to store input/output in MismatchInfo.
- Returns:
MismatchStats with details about any mismatches.
- Raises:
ValueError: If input and output have different lengths.
- Examples:
>>> inputs = ["hello world", "one two three"] >>> outputs = ["həˈloʊ wˈɜːld", "wˈʌn tuː θɹiː fɔːɹ"] # Extra word! >>> stats = detect_mismatches(inputs, outputs) >>> stats.mismatched_lines 1 >>> stats.mismatches[0].line_num 1
- kokorog2p.check_word_alignment(input_texts: list[str], output_texts: list[str], mode: MismatchMode | str = MismatchMode.WARN, input_separator: str | Pattern[str] = re.compile('\\s+'), output_separator: str | Pattern[str] = re.compile('\\s+'), logger: Logger | None = None) tuple[list[str], MismatchStats][source]
Check word alignment between input and output, optionally fixing issues.
This is a convenience function that combines detection and processing.
- Args:
input_texts: Original input texts. output_texts: Phonemized output texts. mode: How to handle mismatches (ignore, warn, remove). input_separator: Word separator for input texts. output_separator: Word separator for output texts. logger: Logger instance.
- Returns:
Tuple of (processed_outputs, statistics).
- Examples:
>>> inputs = ["hello world", "good morning"] >>> outputs = ["həˈloʊ wˈɜːld", "gʊd ˈmɔːnɪŋ ɛkstɹə"] >>> result, stats = check_word_alignment(inputs, outputs, mode="warn") >>> stats.mismatched_lines 1
- kokorog2p.count_words(text: str, separator: str | Pattern[str] = re.compile('\\s+')) int[source]
Count the number of words in text.
- Args:
text: Text to count words in. separator: Word separator (string or regex pattern).
- Returns:
Number of words.
- Examples:
>>> count_words("hello world") 2 >>> count_words("hello world") # Multiple spaces 2 >>> count_words("") 0