English API

English G2P provides high-quality phoneme conversion for US and British English.

Main Class

class kokorog2p.en.EnglishG2P(language: str = 'en-us', use_espeak_fallback: bool = True, use_goruut_fallback: bool = False, use_cli: bool = False, use_spacy: bool = True, spacy_model: str = 'en_core_web_md', expand_abbreviations: bool = True, enable_context_detection: bool = True, phoneme_quotes: str = 'curly', unk: str = '❓', load_silver: bool = True, load_gold: bool = True, strict: bool = True, version: str = '1.0', **kwargs)[source]

Bases: G2PBase

English G2P converter using dictionary lookup with fallback options.

This class provides grapheme-to-phoneme conversion for English text, using a tiered dictionary system (gold/silver) with espeak-ng or goruut as fallback for out-of-vocabulary words.

Example:

>>> g2p = EnglishG2P(language="en-us")
>>> tokens = g2p("Hello world!")
>>> for token in tokens:
...     print(f"{token.text} -> {token.phonemes}")

__init__(language: str = 'en-us', use_espeak_fallback: bool = True, use_goruut_fallback: bool = False, use_cli: bool = False, use_spacy: bool = True, spacy_model: str = 'en_core_web_md', expand_abbreviations: bool = True, enable_context_detection: bool = True, phoneme_quotes: str = 'curly', unk: str = '❓', load_silver: bool = True, load_gold: bool = True, strict: bool = True, version: str = '1.0', **kwargs) → None[source]

Initialize the English G2P converter.

Args:

language: Language code (‘en-us’ or ‘en-gb’). use_espeak_fallback: Whether to use espeak for OOV words. use_goruut_fallback: Whether to use goruut for OOV words. use_cli: Whether to use the espeak CLI instead of the library. use_spacy: Whether to use spaCy for tokenization and POS tagging. spacy_model: spaCy English model package to load when use_spacy=True

(e.g., “en_core_web_sm”, “en_core_web_md”, “en_core_web_lg”).

expand_abbreviations: Whether to expand common abbreviations. enable_context_detection: Context-aware abbreviation expansion. phoneme_quotes: Quote style in phoneme output:

“curly”: Use directional quotes “ and “ (default)

“ascii”: Use ASCII double quote “

“none”: Strip quotes from phoneme output

unk: Character to use for unknown words when fallback is disabled. load_silver: If True, load silver tier dictionary (~100k extra entries).

Defaults to True for backward compatibility and maximum coverage. Set to False to save memory (~22-31 MB) and initialization time.

load_gold: If True, load gold tier dictionary (~170k common words).: Defaults to True for maximum quality and coverage. Set to False when only silver tier or no dictionaries needed.
strict: If True (default), raise exceptions when backend initialization: or phonemization fails. If False, log errors and return empty results. Note: This only affects fallback backends (espeak/goruut), not the primary dictionary lookups.

version: Model version (“1.0” for multilingual model, “1.1” for Chinese model).

Defaults to “1.0”.

**kwargs: Additional arguments for future compatibility.

Raises:

ValueError: If both use_espeak_fallback and use_goruut_fallback are True.

__call__(text: str) → list[GToken][source]

Convert text to a list of tokens with phonemes.

Args:: text: Input text to convert.
Returns:: List of GToken objects with phonemes assigned.

phonemize(text: str) → str

Convert text to a phoneme string.

This is a convenience method that calls __call__ and joins the results.

Args:: text: Input text to convert.
Returns:: Phoneme string with word boundaries.

lookup(word: str, tag: str | None = None) → str | None[source]

Look up a word in the dictionary.

Args:: word: The word to look up. tag: Optional POS tag for disambiguation.
Returns:: Phoneme string or None if not found.

__init__(language: str = 'en-us', use_espeak_fallback: bool = True, use_goruut_fallback: bool = False, use_cli: bool = False, use_spacy: bool = True, spacy_model: str = 'en_core_web_md', expand_abbreviations: bool = True, enable_context_detection: bool = True, phoneme_quotes: str = 'curly', unk: str = '❓', load_silver: bool = True, load_gold: bool = True, strict: bool = True, version: str = '1.0', **kwargs) → None[source]

Initialize the English G2P converter.

Args:

language: Language code (‘en-us’ or ‘en-gb’). use_espeak_fallback: Whether to use espeak for OOV words. use_goruut_fallback: Whether to use goruut for OOV words. use_cli: Whether to use the espeak CLI instead of the library. use_spacy: Whether to use spaCy for tokenization and POS tagging. spacy_model: spaCy English model package to load when use_spacy=True

(e.g., “en_core_web_sm”, “en_core_web_md”, “en_core_web_lg”).

expand_abbreviations: Whether to expand common abbreviations. enable_context_detection: Context-aware abbreviation expansion. phoneme_quotes: Quote style in phoneme output:

“curly”: Use directional quotes “ and “ (default)

“ascii”: Use ASCII double quote “

“none”: Strip quotes from phoneme output

unk: Character to use for unknown words when fallback is disabled. load_silver: If True, load silver tier dictionary (~100k extra entries).

Defaults to True for backward compatibility and maximum coverage. Set to False to save memory (~22-31 MB) and initialization time.

load_gold: If True, load gold tier dictionary (~170k common words).: Defaults to True for maximum quality and coverage. Set to False when only silver tier or no dictionaries needed.
strict: If True (default), raise exceptions when backend initialization: or phonemization fails. If False, log errors and return empty results. Note: This only affects fallback backends (espeak/goruut), not the primary dictionary lookups.

version: Model version (“1.0” for multilingual model, “1.1” for Chinese model).

Defaults to “1.0”.

**kwargs: Additional arguments for future compatibility.

Raises:

ValueError: If both use_espeak_fallback and use_goruut_fallback are True.

property fallback: EspeakFallback | GoruutFallback | None: Lazily initialize the appropriate fallback.

property nlp: object: Lazily initialize spaCy with custom tokenizer rules for contractions.

property normalizer: EnglishNormalizer: Lazily initialize the English text normalizer.

property regex_tokenizer: RegexTokenizer: Lazily initialize the regex tokenizer.

property spacy_tokenizer: SpacyTokenizer: Lazily initialize the spaCy tokenizer.

__call__(text: str) → list[GToken][source]

Convert text to a list of tokens with phonemes.

Args:: text: Input text to convert.
Returns:: List of GToken objects with phonemes assigned.

process_with_debug(text: str) → ProcessedText[source]

Process text with full debugging information.

This method provides detailed provenance tracking showing: - All normalization steps applied - Token positions in original text - Phoneme source (gold/silver/espeak/etc.) for each token - Quote nesting depths

Args:

text: Input text to process

Returns:

ProcessedText object with full debugging information

Example:

>>> g2p = EnglishG2P()
>>> result = g2p.process_with_debug("I'm here")
>>> print(result.render_debug())

lookup(word: str, tag: str | None = None) → str | None[source]

Look up a word in the dictionary.

Args:: word: The word to look up. tag: Optional POS tag for disambiguation.
Returns:: Phoneme string or None if not found.

add_abbreviation(abbreviation: str, expansion: str | dict[str, str], description: str = '', case_sensitive: bool = False) → None[source]

Add or update a custom abbreviation.

This method allows users to add custom abbreviations or override existing ones. Changes persist across all uses of this G2P instance and affect the singleton abbreviation expander (shared across all instances).

Args:

abbreviation: The abbreviation string (e.g., “Dr.”, “Tech.”) expansion: Either a simple string expansion or a dict mapping context

names to expansions. For dict, use context names like: “default”, “title”, “place”, “time”, “academic”, “religious”

description: Optional description of the abbreviation case_sensitive: Whether matching should be case-sensitive

Examples:

>>> g2p = get_g2p("en-us")
>>> # Simple expansion
>>> g2p.add_abbreviation("Tech.", "Technology")
>>> # Context-aware expansion
>>> g2p.add_abbreviation(
...     "Dr.",
...     {"default": "Drive", "title": "Doctor"},
...     "Doctor or Drive (context-dependent)"
... )
>>> g2p.phonemize("I live on Main Dr.")
'aɪ lˈɪv ɒn mˈeɪn dɹˈaɪv.'

remove_abbreviation(abbreviation: str, case_sensitive: bool = False) → bool[source]

Remove an abbreviation.

Args:

abbreviation: The abbreviation to remove (e.g., “Dr.”) case_sensitive: Whether to match case-sensitively

Returns:

True if the abbreviation was found and removed, False otherwise

Example:

>>> g2p = get_g2p("en-us")
>>> g2p.remove_abbreviation("Dr.")
True
>>> # Now "Dr." won't be expanded
>>> g2p.phonemize("Dr. Smith")
'd r. smˈɪθ'

has_abbreviation(abbreviation: str, case_sensitive: bool = False) → bool[source]

Check if an abbreviation exists.

Args:

abbreviation: The abbreviation to check (e.g., “Dr.”) case_sensitive: Whether to match case-sensitively

Returns:

True if the abbreviation exists, False otherwise

Example:

>>> g2p = get_g2p("en-us")
>>> g2p.has_abbreviation("Dr.")
True

list_abbreviations() → list[str][source]

Get a list of all registered abbreviations.

Returns:

List of abbreviation strings

Example:

>>> g2p = get_g2p("en-us")
>>> abbrevs = g2p.list_abbreviations()
>>> "Dr." in abbrevs
True

get_target_model() → str[source]

Get the target Kokoro model variant for this G2P instance.

Returns:: Model identifier: version string (“1.1” or “1.0”).

Lexicon

kokorog2p.en.EnglishLexicon: alias of Lexicon

Number Conversion

Converter Class

Bases: object

Convert numbers to their word representations.

This class handles various number formats including: - Cardinal numbers (1, 2, 3 -> one, two, three) - Ordinal numbers (1st, 2nd -> first, second) - Years (1984 -> nineteen eighty-four) - Decimals (3.14 -> three point one four) - Currency ($12.50 -> twelve dollars and fifty cents)

Initialize the number converter.

Args:: lookup_fn: Function to look up words in the lexicon. stem_s_fn: Function to add -s suffix to words.

convert(word: str, currency: str | None = None, is_head: bool = True, num_flags: set | None = None) → tuple[str | None, int | None][source]

Convert a number to its word representation.

Args:: word: The number string to convert. currency: Optional currency symbol (e.g., ‘$’, ‘£’). is_head: Whether this is the first word in a phrase. num_flags: Optional flags for number formatting.
Returns:: Tuple of (phonemes, rating) or (None, None) if conversion failed.

Initialize the number converter.

Args:: lookup_fn: Function to look up words in the lexicon. stem_s_fn: Function to add -s suffix to words.

property num2words: Callable: Lazily import num2words.

convert(word: str, currency: str | None = None, is_head: bool = True, num_flags: set | None = None) → tuple[str | None, int | None][source]

Convert a number to its word representation.

Args:: word: The number string to convert. currency: Optional currency symbol (e.g., ‘$’, ‘£’). is_head: Whether this is the first word in a phrase. num_flags: Optional flags for number formatting.
Returns:: Tuple of (phonemes, rating) or (None, None) if conversion failed.

append_currency(phonemes: str, currency: str | None) → str[source]

Append currency word to phonemes.

Args:: phonemes: The phoneme string. currency: Currency symbol.
Returns:: Phonemes with currency word appended.

Helper Functions

kokorog2p.en.numbers.is_digit(text: str) → bool[source]: Check if text consists only of digits.

kokorog2p.en.numbers.is_currency_amount(word: str) → bool[source]

Check if word looks like a currency amount (e.g., ‘12.99’, ‘30,000.10’).

Rules: - Optional thousands separators, but only valid grouping (1,234,567) - Optional decimal part (.<digits>) - Reject invalid grouping like ‘12,34’ or ‘1,23,456’ - Allow leading-decimal amounts like ‘.50’

Constants

kokorog2p.en.numbers.ORDINALS: Build an immutable unordered collection of unique elements.

kokorog2p.en.numbers.CURRENCIES

dict() -> new empty dictionary dict(mapping) -> new dictionary initialized from a mapping object’s

(key, value) pairs

dict(iterable) -> new dictionary initialized as if via:: d = {} for k, v in iterable:

d[k] = v
dict(**kwargs) -> new dictionary initialized with the name=value pairs: in the keyword argument list. For example: dict(one=1, two=2)

Examples

Basic Usage

from kokorog2p.en import EnglishG2P

# US English
g2p = EnglishG2P(language="en-us")
tokens = g2p("Hello world!")

for token in tokens:
    print(f"{token.text} -> {token.phonemes}")

# British English
g2p_gb = EnglishG2P(language="en-gb")
tokens = g2p_gb("Hello world!")

spaCy Model Selection

English G2P uses spaCy for POS tagging when use_spacy=True. You can choose the spaCy English model with spacy_model:

from kokorog2p.en import EnglishG2P

# Default model (recommended balance)
g2p_md = EnglishG2P(use_spacy=True, spacy_model="en_core_web_md")

# Smaller model
g2p_sm = EnglishG2P(use_spacy=True, spacy_model="en_core_web_sm")

# Larger model
g2p_lg = EnglishG2P(use_spacy=True, spacy_model="en_core_web_lg")

Dictionary Lookup

from kokorog2p.en import EnglishLexicon

lexicon = EnglishLexicon(language="en-us")

# Simple lookup
phonemes = lexicon.lookup("hello")
print(phonemes)  # həlˈO

# POS-aware lookup
read_present = lexicon.lookup("read", tag="VB")
read_past = lexicon.lookup("read", tag="VBD")

Number Expansion

from kokorog2p.en import EnglishG2P

# Numbers are automatically expanded during G2P processing
g2p = EnglishG2P(language="en-us")
tokens = g2p("I have $42.50 and 3 cats.")

for token in tokens:
    print(f"{token.text} -> {token.phonemes}")
# → I -> aɪ
# → have -> hæv
# → forty-two dollars and fifty cents -> ...
# → and -> ænd
# → three -> θɹi
# → cats -> kæts

Punctuation Normalization

English G2P automatically normalizes punctuation variants:

from kokorog2p.en import EnglishG2P

g2p = EnglishG2P(language="en-us")

# Apostrophe variants (all normalize to ')
g2p("don't")    # Right single quote (')
g2p("don't")    # Apostrophe (')
g2p("don`t")    # Grave accent (`)
g2p("don´t")    # Acute accent (´)

# Ellipsis variants (all normalize to …)
g2p("Wait...")       # Three dots
g2p("Wait. . .")     # Spaced dots
g2p("Wait…")         # Ellipsis character

# Dash variants (all normalize to — when spaced)
g2p("Wait - now")    # Hyphen with spaces
g2p("Wait -- now")   # Double hyphen
g2p("Wait – now")    # En dash
g2p("Wait — now")    # Em dash
g2p("Wait ― now")    # Horizontal bar
g2p("Wait ‒ now")    # Figure dash
g2p("Wait − now")    # Minus sign

# Compound words keep hyphens (then removed in output)
g2p("well-known")         # Hyphen joins words
g2p("state-of-the-art")   # Multiple hyphens

Normalized Characters:

Apostrophes: ' ' ' ```` `` ´ ʹ ′ ＇ → '
Ellipsis: ... . . . .. .... → …
Dashes (when spaced): - -- – ― ‒ − → —