API Reference ============ This page provides detailed documentation for the LAPA-NG Python API, a system for phonetic transcription of Dutch text using rule-based pattern matching. Core Components ------------- The LAPA-NG system is built around several core components that work together to perform phonetic transcription: Translator ~~~~~~~~~ The ``Translator`` protocol defines the interface for translating words into phonemes. It provides a single method: .. code-block:: python def translate(self, word: WordOrWordList, *, emit: EmitValue | None) -> Generator[TranslationResult, None, None] The translator can process either a single word or a list of words, and can emit results at different granularities (word, rule, or phoneme level). Matcher ~~~~~~~ The ``Matcher`` protocol defines how to match substrings of words against rules and return corresponding phonetic transcriptions: .. code-block:: python def match(self, word: Word, start: int) -> Generator[MatchResult, None, None] Implementations ~~~~~~~~~~~~~ The system provides several concrete implementations: - ``MatchingTranslator``: A translator that uses a matcher to translate words into phonemes - ``CachedTranslator``: A translator wrapper that caches results to improve performance Data Types ~~~~~~~~~ The system uses several key data types: - ``Word``: Represents a single word with optional attributes - ``Phoneme``: Represents a single phoneme with various representations (SAMPA, IPA) - ``MatchResult``: Contains the result of a successful rule match - ``TranslationResult``: Contains the final translation result for a word Rule Systems ----------- LAPA-NG supports two complementary rule systems for phonetic transcription: Table Rules ~~~~~~~~~~ The table rules system provides a user-friendly way to define phonetic transcription rules using tabular data (e.g., Excel spreadsheets). This system is designed to be more accessible to linguists and phoneticians who may not be familiar with regular expressions. Key Components: - ``TabularRule``: Represents a single rule from a tabular data source - ``rule_id``: Unique identifier for the rule - ``rule_class``: Type of rule (VOWEL, CONSONANT, or PREFIX) - ``letter``: Initial letter that the rule applies to - ``priority``: Priority value for rule ordering - ``description``: Human-readable description of the rule - ``rule``: The rule pattern or definition - ``replaced``: Letter sequence to be replaced - ``replaceby``: Replacement letter sequence - ``RuleClass``: Enumeration of rule types - ``VOWEL``: Rules for vowel sounds - ``CONSONANT``: Rules for consonant sounds - ``PREFIX``: Rules for prefix patterns The table rules system provides utilities for: - Loading rules from tabular data sources - Converting tabular rules to regex specifications - Sorting rules by priority - Checking for duplicate priorities - Creating matchers from tabular rules Regex Rules ~~~~~~~~~~ The regex rules system provides a more powerful and flexible way to define phonetic transcription rules using regular expressions. This system is used internally to implement the actual matching logic. Key Components: - ``RegexRuleSpec``: Specification for a regular expression based rule - ``id``: Unique identifier for the rule - ``pattern``: Compiled regular expression pattern - ``replacement``: Phonetic replacement string - ``meta``: Additional metadata about the rule - ``RegexMatcher``: A matcher that uses regular expressions for pattern matching - Optimized for prefix rules and match group extraction - Supports character classes for common patterns - Includes caching for improved performance - ``RegexListMatcher``: An optimized list matcher for regex-based rules - Uses caching and filtering based on the first letter - Reduces the number of rules that need to be attempted - Maintains rule ordering and priority The regex rules system provides functions for: - Loading rule specifications from YAML files - Creating matchers from rule specifications - Optimizing rule matching performance Factory Pattern ------------- LAPA-NG uses a factory pattern to create different types of matchers based on a specification string. This provides a flexible and consistent way to create matchers for different use cases. Matcher Specification ~~~~~~~~~~~~~~~~~~~ The matcher specification follows the format: [prefix:][filename[#sheet]][?options] Where: - ``prefix``: Optional prefix indicating the type of matcher ('ng' or 'classic') - ``filename``: Path to the rules file (Excel or YAML) - ``sheet``: Optional sheet name for Excel files - ``options``: Optional query string parameters (e.g., ?sort=numeric) Available options for the 'ng' prefix: - ``sort``: Rule sorting method ('numeric' or 'alpha') - ``numeric``: Sort rules by numeric priority (default) - ``alpha``: Sort rules alphabetically by letter and priority Examples: .. code-block:: python # Next-gen matcher with specific sheet and numeric sorting (default) matcher = create_matcher('ng:rules.xlsx#RULES') # Next-gen matcher with alpha sorting matcher = create_matcher('ng:rules.xlsx#RULES?sort=alpha') # Classic matcher, default sheet matcher = create_matcher('classic:rules.xlsx') # Next-gen matcher (default prefix) matcher = create_matcher('rules.xlsx#RULES') Factory Functions ~~~~~~~~~~~~~~~ The factory module provides the following functions: .. py:function:: create_matcher(matcher_spec: str) -> Matcher Create a matcher based on a specification string. Args: matcher_spec: Specification string in format '[prefix:][filename[#sheet]][?options]' Returns: A Matcher instance configured according to the specification Raises: ValueError: If the prefix is unknown, the specification is invalid, or an invalid sort option is provided .. py:function:: parse_matcher_spec(matcher_spec: str) -> MatcherSpec Parse a matcher specification string into its components. Args: matcher_spec: The specification string to parse Returns: A MatcherSpec object containing the parsed components Raises: ValueError: If the specification string is invalid .. py:class:: MatcherSpec A dataclass representing a parsed matcher specification. Attributes: prefix: The matcher prefix ('ng' or 'classic') filename: Path to the rules file section: Optional sheet name options: Optional query string parameters Properties: qs: Dictionary of parsed query string parameters qs_flat: Simplified dictionary with single values for each parameter Command-Line Interface -------------------- LAPA-NG provides a command-line interface for common operations: Converting Rules ~~~~~~~~~~~~~~ Convert Excel-based rules to YAML format: .. code-block:: bash lapa-ng convert-excel rules.xlsx rules.yaml # Optional: specify a particular sheet lapa-ng convert-excel rules.xlsx rules.yaml --sheet "RULES" This command reads rules from an Excel file and converts them to YAML format, which can be used directly with the regex rules system. Transcribing Text ~~~~~~~~~~~~~~~ Transcribe words from the command line: .. code-block:: bash lapa-ng translate-words 'rules.xlsx#RULES' word1 word2 word3 This command transcribes one or more words using the specified rules and outputs the phonetic transcription in SAMPA format. Processing NAF Files ~~~~~~~~~~~~~~~~~~ Process text from NAF (NLP Annotation Framework) files: .. code-block:: bash lapa-ng translate-naf 'rules.xlsx#RULES' input.naf This command reads text from a NAF file, transcribes it using the specified rules, and outputs the results in CSV format with detailed information about each transcription, including: - Word ID and text - Position in the word - Matched pattern - Phoneme in SAMPA format - Rule ID used - Number of rules attempted Testing ~~~~~~~ Run the test suite: .. code-block:: bash lapa-ng test This command runs the test suite to verify the system is working correctly. Usage Example ------------ Here's a complete example of how the components work together: 1. Define rules in an Excel spreadsheet with columns for: - Rule ID - Rule class (VOWEL, CONSONANT, PREFIX) - Letter - Priority - Description - Rule pattern - Replacement pattern 2. Convert the rules to YAML format: .. code-block:: bash lapa-ng convert-excel rules.xlsx rules.yaml 3. Use the rules to transcribe text: .. code-block:: bash lapa-ng translate-words 'rules.xlsx#RULES' "voorbeeld" "taal" # Output: v r o n d @ r b @ l t a l 4. Process a NAF file: .. code-block:: bash lapa-ng translate-naf 'rules.xlsx#RULES' document.naf > transcriptions.csv The system will: 1. Load and validate the rules 2. Convert them to an optimized regex-based format 3. Process the input text 4. Apply the rules in the correct order 5. Output the phonetic transcriptions