简体   繁体   中英

Regular expression to parse word structure

I'm trying to build my first non-trivial regular expression (for use in Python), but struggling.

Let us assume that a word in language X (NOT English) is a sequence of minimal 'structures'. Each 'structure' could be:

An independent vowel (basically one letter of the alphabet)
A consonant (one letter of the alphabet)
A consonant followed by a right-attaching vowel
A left-attaching vowel followed by a consonant
(Certain left-attaching vowels) followed by a consonant followed by (certain right-attaching vowels)

For example this word of 3 characters:

<a consonant><a left-attaching vowel><an independent vowel>

is not a valid word, and should not match the regex, because there is no consonant to the right of the left-attaching vowel.

I know all the Unicode ranges - the Unicode ranges for consonants, independent vowels, left-attaching vowels and so on.

Here is what I have so far:

WordPattern = (
ur'('
ur'[\u0985-\u0994]|'
ur'[\u0995-\u09B9]|'
ur'[\u0995-\u09B9(\u09BE|[\u09C0-\u09C4])]|'
ur'[(\u09BF|\u09C7|\u09C8)\u0995-\u09B9]|'
ur'[(\u09BF|\u09C7|\u09C8)\u0995-\u09B9(\u09BE|[\u09C0-\u09C4])]'
ur')+'
)

It's not working. Apart from getting it to work, I have three specific problems:

  • I need to split the regular expression over multiple lines, or else the code is going to look terrible. How do I do this?
  • I would like to use string substitution / templates of some sort to 'name' the Unicode ranges, for code readability and to prevent typing Unicode ranges multiple times.
  • (This seems very difficult) The list of permissible minimal 'structures' will have to be extended later. Is there any way to set up a sort of 'loop' mechanism within a regex, so that it works for all permissible structures in a list?

Any help would be appreciated. This seems very complex to a beginner!

The appropriate tool for morphological analysis of languages with non-trivial morphology is "finite state transducers". There are robust implementations that you can track down and use (one by Xerox Parc). There's one that has python bindings (for using as an external library). Google it.

FSTs are based on finite-state automata, like (pure) regular expressions, but they are by no means a drop-in replacement. It's complex machinery, so if your goals are simple (eg, syllabification for purposes of hyphenation) you may want to look for something simpler. There are machine-learning algorithms that will "learn" hyphenation, for example. If you are indeed interested in morphological analysis, you have to make the effort to look at FSTs.

Now for your algorithm, in case you really only need a trivial implementation: Since any vowel or consonant could be independent, your rules are ambiguous: They allow "ab" to be parsed as "ab". Such ambiguities mean that a regexp approach will probably never work, but you may get better results if you put the longer regexps first, so they are used in preference to the short ones when both would apply. But really you need to build a parser (by hand or using a module) and try different things in steps. It's backwards from what you imagined: Set up a loop that uses different regexps, and "consumes" the string in steps.

However, it seems to me that what you are describing is essentially syllabification. And the near-universal rule of syllabification is this: A syllable consists of a core vowel, plus as many preceding ("onset") consonants as the rules of the language allow, plus any following consonants that cannot belong to the next syllable. The rule is called "maximize onset", and it has the consequence that it's easier to parse your syllables backwards (from the end of the word). Try it out.

PS. You probably know this, but if you put the following as the second line in your scripts you can embed Bengali in your regexps:

# -*- coding: utf-8 -*-
  • I need to split the regular expression over multiple lines, or else the code is going to look terrible. How do I do this?

Use the re.VERBOSE flag when compiling the regex.

pattern = re.compile(r"""(
                            [\u0985-\u0994]  # comment to explain what this is
                          | [\u0995-\u09B9]
                          # etc.
                         )
                      """, re.VERBOSE)
  • I would like to use string substitution / templates of some sort to 'name' the Unicode ranges

You can construct an RE from ordinary Python strings:

>>> subpatterns = {"vowel": "[aeiou]", "consonant": "[^aeiou]"}
>>> "{consonant}{vowel}+{consonant}*".format(**subpatterns)
'[^aeiou][aeiou]+[^aeiou]*'
  • The list of permissible minimal 'structures' will have to be extended later. Is there any way to set up a sort of 'loop' mechanism within a regex, so that it works for all permissible structures in a list?

I'm not sure if I get what you mean, but... suppose you have a list of (uncompiled) REs, say, patterns , then you can compute their union with

re.compile("(%s)" % "|".join(patterns))

Be careful with special characters when constructing REs this way and use re.escape where necessary.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM