Abbreviation Customization Guide
Overview
The kokorog2p library provides a flexible abbreviation expansion system that allows you to customize which abbreviations are expanded and how. This is particularly useful when:
You want to disable specific abbreviations
You need to add custom abbreviations for domain-specific terms
You want to change how an abbreviation expands (e.g., always expand “Dr.” to “Drive” instead of “Doctor”)
You need context-aware expansions (e.g., “St.” → “Street” vs “Saint”)
Quick Start
from kokorog2p import get_g2p
# Get a G2P instance
g2p = get_g2p("en-us")
# Remove an abbreviation
g2p.remove_abbreviation("Dr.")
# Add a custom abbreviation
g2p.add_abbreviation("Dr.", "Drive")
# Test it
print(g2p.phonemize("I live on Main Dr."))
# → 'I live on Main Drive' (phonemized)
API Reference
add_abbreviation()
- add_abbreviation(abbreviation, expansion, description='', case_sensitive=False)
Add or update an abbreviation.
- Parameters:
abbreviation (str) – The abbreviation string (e.g., “Dr.”, “Tech.”)
expansion (str or dict) – Either a simple string expansion or a dict for context-aware expansion
description (str) – Description of the abbreviation (optional)
case_sensitive (bool) – Whether matching should be case-sensitive (optional)
Examples:
# Simple expansion g2p.add_abbreviation("Tech.", "Technology") # Context-aware expansion g2p.add_abbreviation( "Dr.", { "default": "Drive", "title": "Doctor" }, "Doctor or Drive (context-dependent)" )
Available contexts:
default: Default expansion when context is unknowntitle: Title/honorific context (e.g., “Dr. Smith”)place: Place name context (e.g., “123 Main Dr.”)time: Time-related context (e.g., “3 P.M.”)academic: Academic degree context (e.g., “Ph.D.”)religious: Religious context (e.g., “St. Peter”)
Note
The “St.” abbreviation uses an advanced multi-signal detection algorithm:
Priority 1: Saint/city name recognition (23 names: peter, paul, john, mary, patrick, francis, joseph, michael, george, luke, mark, matthew, thomas, james, anthony, andrew, louis, petersburg, augustine, helena, cloud, albans, andrews)
Priority 2: House number pattern within 30 characters (e.g., “123 Main”)
Priority 3: Defaults to “Saint” for unknown names
Examples:
# Street context (house number pattern) g2p.phonemize("123 Main St.") # → "123 Main Street" # Saint context (name recognized) g2p.phonemize("St. Patrick's Day") # → "Saint Patrick's Day" # City context (name recognized) g2p.phonemize("Visit St. Louis") # → "Visit Saint Louis" # Distant number ignored g2p.phonemize("Born in 1850, St. Peter was influential") # → "Born in 1850, Saint Peter was influential"
remove_abbreviation()
- remove_abbreviation(abbreviation, case_sensitive=False)
Remove an abbreviation.
- Parameters:
- Returns:
True if the abbreviation was found and removed, False otherwise
- Return type:
Example:
g2p.remove_abbreviation("Dr.") # Returns True g2p.remove_abbreviation("Xyz.") # Returns False (doesn't exist)
has_abbreviation()
- has_abbreviation(abbreviation, case_sensitive=False)
Check if an abbreviation exists.
- Parameters:
- Returns:
True if the abbreviation exists, False otherwise
- Return type:
Example:
if g2p.has_abbreviation("Dr."): print("Dr. abbreviation exists")
list_abbreviations()
Common Use Cases
1. Disable an Abbreviation
If you don’t want “Dr.” to be expanded at all:
g2p = get_g2p("en-us")
g2p.remove_abbreviation("Dr.")
# Now "Dr." will be treated as unknown text
text = "Dr. Smith"
# "Dr." won't be expanded to "Doctor"
2. Replace an Abbreviation
Replace “Dr.” so it always expands to “Drive”:
g2p = get_g2p("en-us")
# Remove the original
g2p.remove_abbreviation("Dr.")
# Add new expansion
g2p.add_abbreviation("Dr.", "Drive")
# Test
print(g2p.phonemize("I live on Main Dr."))
# "Dr." → "Drive"
3. Add Domain-Specific Abbreviations
Add abbreviations for your specific domain:
g2p = get_g2p("en-us")
# Add technical abbreviations
g2p.add_abbreviation("API", "Application Programming Interface")
g2p.add_abbreviation("ML", "Machine Learning")
g2p.add_abbreviation("GPU", "Graphics Processing Unit")
# Use them
text = "The API uses ML on the GPU."
print(g2p.phonemize(text))
4. Context-Aware Abbreviations
Create abbreviations that expand differently based on context:
g2p = get_g2p("en-us")
# "Av." can mean "Avenue" or "Average"
g2p.add_abbreviation(
"Av.",
{
"default": "Average",
"place": "Avenue"
}
)
# In address context
print(g2p.phonemize("123 Park Av."))
# → "123 Park Avenue"
# In other context
print(g2p.phonemize("The av. is 50."))
# → "The average is 50."
5. Batch Customization
Customize multiple abbreviations at once:
g2p = get_g2p("en-us")
# Remove unwanted abbreviations
for abbr in ["Dr.", "Mr.", "Mrs.", "Ms."]:
g2p.remove_abbreviation(abbr)
# Add custom ones
custom_abbrevs = {
"Tech.": "Technology",
"Corp.": "Corporation",
"Dept.": "Department"
}
for abbr, expansion in custom_abbrevs.items():
g2p.add_abbreviation(abbr, expansion)
Persistence
Changes to abbreviations persist across get_g2p() calls because they modify the singleton abbreviation expander:
# First instance
g2p1 = get_g2p("en-us")
g2p1.add_abbreviation("Custom.", "Customized")
# Second instance (same configuration)
g2p2 = get_g2p("en-us")
print(g2p2.has_abbreviation("Custom.")) # True
To reset, use reset_abbreviations():
from kokorog2p import reset_abbreviations
reset_abbreviations() # Reset abbreviation expanders
Note
clear_cache() only clears cached G2P instances; it does not reset
abbreviation expanders. reset_abbreviations() resets expanders and
clears cached G2P instances.
Advanced: Working with the Expander Directly
You can also work directly with the abbreviation expander:
from kokorog2p.en.abbreviations import get_expander
expander = get_expander()
# Get abbreviation details
entry = expander.get_abbreviation("Dr.")
print(entry.abbreviation) # "Dr."
print(entry.expansion) # "Doctor" or "Drive"
print(entry.context_expansions) # Context-specific expansions
print(entry.description) # Description
Notes
Case Sensitivity: By default, abbreviations are case-insensitive. Use
case_sensitive=Trueif you need exact matching.Singleton Behavior: The abbreviation expander is a singleton, so changes affect all G2P instances using the same language.
Context Detection: Context-aware expansions require
enable_context_detection=True(default) when creating the G2P instance.Order Matters: When removing and adding the same abbreviation, make sure to remove first, then add.
Example Script
See examples/abbreviation_customization.py for a complete working example demonstrating all features.
Troubleshooting
Q: My custom abbreviation isn’t being expanded.
A: Check:
Did you enable abbreviation expansion? (
expand_abbreviations=Trueis default)Is the abbreviation properly formatted with punctuation?
Use
has_abbreviation()to verify it was added
Q: Changes don’t persist after restarting.
A: Abbreviation customizations are in-memory only. If you need persistent customizations, add them at startup or create a configuration system.
Q: Context-aware expansion isn’t working.
A: Make sure enable_context_detection=True when creating the G2P instance (it’s the default).
See Also
English Abbreviations Source - Default abbreviations
Abbreviation Pipeline - Base framework
Examples - Working examples