English API =========== English G2P provides high-quality phoneme conversion for US and British English. Main Class ---------- .. autoclass:: kokorog2p.en.EnglishG2P :members: :undoc-members: :show-inheritance: .. automethod:: __init__ .. automethod:: __call__ .. automethod:: phonemize .. automethod:: lookup Lexicon ------- .. autoclass:: kokorog2p.en.EnglishLexicon :members: :undoc-members: :show-inheritance: .. automethod:: __init__ .. automethod:: lookup .. automethod:: is_known .. automethod:: __len__ Number Conversion ----------------- Converter Class ~~~~~~~~~~~~~~~ .. autoclass:: kokorog2p.en.numbers.NumberConverter :members: :undoc-members: .. automethod:: __init__ .. automethod:: convert Helper Functions ~~~~~~~~~~~~~~~~ .. autofunction:: kokorog2p.en.numbers.is_digit .. autofunction:: kokorog2p.en.numbers.is_currency_amount Constants ~~~~~~~~~ .. autodata:: kokorog2p.en.numbers.ORDINALS :annotation: .. autodata:: kokorog2p.en.numbers.CURRENCIES :annotation: Examples -------- Basic Usage ~~~~~~~~~~~ .. code-block:: python from kokorog2p.en import EnglishG2P # US English g2p = EnglishG2P(language="en-us") tokens = g2p("Hello world!") for token in tokens: print(f"{token.text} -> {token.phonemes}") # British English g2p_gb = EnglishG2P(language="en-gb") tokens = g2p_gb("Hello world!") spaCy Model Selection ~~~~~~~~~~~~~~~~~~~~~ English G2P uses spaCy for POS tagging when ``use_spacy=True``. You can choose the spaCy English model with ``spacy_model``: .. code-block:: python from kokorog2p.en import EnglishG2P # Default model (recommended balance) g2p_md = EnglishG2P(use_spacy=True, spacy_model="en_core_web_md") # Smaller model g2p_sm = EnglishG2P(use_spacy=True, spacy_model="en_core_web_sm") # Larger model g2p_lg = EnglishG2P(use_spacy=True, spacy_model="en_core_web_lg") Dictionary Lookup ~~~~~~~~~~~~~~~~~ .. code-block:: python from kokorog2p.en import EnglishLexicon lexicon = EnglishLexicon(language="en-us") # Simple lookup phonemes = lexicon.lookup("hello") print(phonemes) # həlˈO # POS-aware lookup read_present = lexicon.lookup("read", tag="VB") read_past = lexicon.lookup("read", tag="VBD") Number Expansion ~~~~~~~~~~~~~~~~ .. code-block:: python from kokorog2p.en import EnglishG2P # Numbers are automatically expanded during G2P processing g2p = EnglishG2P(language="en-us") tokens = g2p("I have $42.50 and 3 cats.") for token in tokens: print(f"{token.text} -> {token.phonemes}") # → I -> aɪ # → have -> hæv # → forty-two dollars and fifty cents -> ... # → and -> ænd # → three -> θɹi # → cats -> kæts Punctuation Normalization ~~~~~~~~~~~~~~~~~~~~~~~~~~ English G2P automatically normalizes punctuation variants: .. code-block:: python from kokorog2p.en import EnglishG2P g2p = EnglishG2P(language="en-us") # Apostrophe variants (all normalize to ') g2p("don't") # Right single quote (') g2p("don't") # Apostrophe (') g2p("don`t") # Grave accent (`) g2p("don´t") # Acute accent (´) # Ellipsis variants (all normalize to …) g2p("Wait...") # Three dots g2p("Wait. . .") # Spaced dots g2p("Wait…") # Ellipsis character # Dash variants (all normalize to — when spaced) g2p("Wait - now") # Hyphen with spaces g2p("Wait -- now") # Double hyphen g2p("Wait – now") # En dash g2p("Wait — now") # Em dash g2p("Wait ― now") # Horizontal bar g2p("Wait ‒ now") # Figure dash g2p("Wait − now") # Minus sign # Compound words keep hyphens (then removed in output) g2p("well-known") # Hyphen joins words g2p("state-of-the-art") # Multiple hyphens **Normalized Characters:** * Apostrophes: ``'`` ``'`` ``'`` ```` `` ``´`` ``ʹ`` ``′`` ``'`` → ``'`` * Ellipsis: ``...`` ``. . .`` ``..`` ``....`` → ``…`` * Dashes (when spaced): ``-`` ``--`` ``–`` ``―`` ``‒`` ``−`` → ``—``