用于解析单词结构的正则表达式

Question

I'm trying to build my first non-trivial regular expression (for use in Python), but struggling. 我正在尝试构建我的第一个非平凡的正则表达式（用于Python），但正在努力。

Let us assume that a word in language X (NOT English) is a sequence of minimal 'structures'. 让我们假设语言X（不是英语）中的单词是最小“结构”的序列。 Each 'structure' could be: 每个'结构'可以是：

An independent vowel (basically one letter of the alphabet)
A consonant (one letter of the alphabet)
A consonant followed by a right-attaching vowel
A left-attaching vowel followed by a consonant
(Certain left-attaching vowels) followed by a consonant followed by (certain right-attaching vowels)

For example this word of 3 characters: 例如，这个3个字的单词：

<a consonant><a left-attaching vowel><an independent vowel>

is not a valid word, and should not match the regex, because there is no consonant to the right of the left-attaching vowel. 不是一个有效的单词，并且不应该与正则表达式匹配，因为左附加元音的右边没有辅音。

I know all the Unicode ranges - the Unicode ranges for consonants, independent vowels, left-attaching vowels and so on. 我知道所有Unicode范围 - 辅音，独立元音，左连元音等的Unicode范围。

Here is what I have so far: 这是我到目前为止：

WordPattern = (
ur'('
ur'[\u0985-\u0994]|'
ur'[\u0995-\u09B9]|'
ur'[\u0995-\u09B9(\u09BE|[\u09C0-\u09C4])]|'
ur'[(\u09BF|\u09C7|\u09C8)\u0995-\u09B9]|'
ur'[(\u09BF|\u09C7|\u09C8)\u0995-\u09B9(\u09BE|[\u09C0-\u09C4])]'
ur')+'
)

It's not working. 它不起作用。 Apart from getting it to work, I have three specific problems: 除了让它工作，我有三个具体问题：

I need to split the regular expression over multiple lines, or else the code is going to look terrible. 我需要将正则表达式拆分为多行，否则代码看起来会很糟糕。 How do I do this? 我该怎么做呢？
I would like to use string substitution / templates of some sort to 'name' the Unicode ranges, for code readability and to prevent typing Unicode ranges multiple times. 我想使用某种字符串替换/模板来“命名”Unicode范围，以获得代码可读性并防止多次键入Unicode范围。
(This seems very difficult) The list of permissible minimal 'structures' will have to be extended later. （这似乎非常困难）允许的最小“结构”列表必须在以后扩展。 Is there any way to set up a sort of 'loop' mechanism within a regex, so that it works for all permissible structures in a list? 有没有办法在正则表达式中设置一种“循环”机制，以便它适用于列表中的所有允许结构？

Any help would be appreciated. 任何帮助，将不胜感激。 This seems very complex to a beginner! 这对初学者来说似乎非常复杂！

Answer 1

The appropriate tool for morphological analysis of languages with non-trivial morphology is "finite state transducers". 用于具有非平凡形态的语言的形态分析的适当工具是“有限状态传感器”。 There are robust implementations that you can track down and use (one by Xerox Parc). 您可以跟踪和使用强大的实现（一个由Xerox Parc提供）。 There's one that has python bindings (for using as an external library). 有一个有python绑定（用作外部库）。 Google it. 谷歌一下。

FSTs are based on finite-state automata, like (pure) regular expressions, but they are by no means a drop-in replacement. FST基于有限状态自动机，如（纯）正则表达式，但它们绝不是替代品。 It's complex machinery, so if your goals are simple (eg, syllabification for purposes of hyphenation) you may want to look for something simpler. 这是复杂的机器，所以如果你的目标很简单（例如，用于连字的音节），你可能想要寻找更简单的东西。 There are machine-learning algorithms that will "learn" hyphenation, for example. 例如，有机器学习算法将“学习”连字符。 If you are indeed interested in morphological analysis, you have to make the effort to look at FSTs. 如果您确实对形态分析感兴趣，那么您必须努力查看FST。

Now for your algorithm, in case you really only need a trivial implementation: Since any vowel or consonant could be independent, your rules are ambiguous: They allow "ab" to be parsed as "ab". 现在对于你的算法，如果你真的只需要一个简单的实现：由于任何元音或辅音可以是独立的，你的规则是模糊的：它们允许“ab”被解析为“ab”。 Such ambiguities mean that a regexp approach will probably never work, but you may get better results if you put the longer regexps first, so they are used in preference to the short ones when both would apply. 这种模糊性意味着regexp方法可能永远不会起作用，但是如果你把较长的regexp放在第一位，你可能会得到更好的结果，因此当它们都适用时，它们会优先于短的regexp。 But really you need to build a parser (by hand or using a module) and try different things in steps. 但实际上你需要构建一个解析器（手动或使用模块）并逐步尝试不同的东西。 It's backwards from what you imagined: Set up a loop that uses different regexps, and "consumes" the string in steps. 它与你想象的相反：设置一个使用不同正则表达式的循环，并逐步“消耗”字符串。

However, it seems to me that what you are describing is essentially syllabification. 然而，在我看来，你所描述的基本上是音节化。 And the near-universal rule of syllabification is this: A syllable consists of a core vowel, plus as many preceding ("onset") consonants as the rules of the language allow, plus any following consonants that cannot belong to the next syllable. 并且近似普遍的音节规则是这样的：一个音节由一个核心元音组成，加上许多先前（“起始”）辅音，如语言允许的规则，加上任何不属于下一个音节的辅音。 The rule is called "maximize onset", and it has the consequence that it's easier to parse your syllables backwards (from the end of the word). 该规则被称为“最大化开始”，其结果是更容易向后解析您的音节（从单词的结尾）。 Try it out. 试试看。

PS. PS。 You probably know this, but if you put the following as the second line in your scripts you can embed Bengali in your regexps: 您可能知道这一点，但如果您将以下内容作为脚本中的第二行，则可以将孟加拉语嵌入到正则表达式中：

# -*- coding: utf-8 -*-

Answer 2

I need to split the regular expression over multiple lines, or else the code is going to look terrible. 我需要将正则表达式拆分为多行，否则代码看起来会很糟糕。 How do I do this? 我该怎么做呢？

Use the re.VERBOSE flag when compiling the regex. 编译正则表达式时使用re.VERBOSE标志。

pattern = re.compile(r"""(
                            [\u0985-\u0994]  # comment to explain what this is
                          | [\u0995-\u09B9]
                          # etc.
                         )
                      """, re.VERBOSE)

I would like to use string substitution / templates of some sort to 'name' the Unicode ranges 我想使用某种字符串替换/模板来“命名”Unicode范围

You can construct an RE from ordinary Python strings: 您可以从普通的Python字符串构造RE：

>>> subpatterns = {"vowel": "[aeiou]", "consonant": "[^aeiou]"}
>>> "{consonant}{vowel}+{consonant}*".format(**subpatterns)
'[^aeiou][aeiou]+[^aeiou]*'

The list of permissible minimal 'structures' will have to be extended later. 允许的最小“结构”列表必须在以后扩展。 Is there any way to set up a sort of 'loop' mechanism within a regex, so that it works for all permissible structures in a list? 有没有办法在正则表达式中设置一种“循环”机制，以便它适用于列表中的所有允许结构？

I'm not sure if I get what you mean, but... suppose you have a list of (uncompiled) REs, say, patterns , then you can compute their union with 我不确定我是否理解你的意思，但是......假设你有一个（未编译的）RE列表，比如patterns ，那么你可以用它来计算它们的联合

re.compile("(%s)" % "|".join(patterns))

Be careful with special characters when constructing REs this way and use re.escape where necessary. 在以这种方式构造RE时要小心特殊字符，并在必要时使用re.escape 。

用于解析单词结构的正则表达式

问题描述

2 个解决方案

解决方案1
4 2012-04-16 13:55:17

解决方案2
0 2012-04-16 12:40:32

用于解析单词结构的正则表达式

问题描述

2 个解决方案

解决方案1 4 2012-04-16 13:55:17

解决方案2 0 2012-04-16 12:40:32

解决方案1
4 2012-04-16 13:55:17

解决方案2
0 2012-04-16 12:40:32