简体   繁体   English

正则表达式获取具有特定字母的所有单词列表(unicode字形)

[英]Regex to get list of all words with specific letters (unicode graphemes)

I'm writing a Python script for a FOSS language learning initiative. 我正在为FOSS语言学习计划编写Python脚本。 Let's say I have an XML file (or to keep it simple, a Python list) with a list of words in a particular language (in my case, the words are in Tamil, which uses a Brahmi-based Indic script). 假设我有一个XML文件(或保持简单,一个Python列表),其中包含特定语言的单词列表(在我的例子中,单词是泰米尔语,它使用基于Brahmi的印度语脚本)。

I need to draw out the subset of those words that can be spelled using just those letters. 我需要绘制那些可以使用这些字母拼写的单词的子集。

An English example: 一个英文例子:

words = ["cat", "dog", "tack", "coat"] 

get_words(['o', 'c', 'a', 't']) should return ["cat", "coat"]
get_words(['k', 'c', 't', 'a']) should return ["cat", "tack"]

A Tamil example: 泰米尔语的例子:

words = [u"மரம்", u"மடம்", u"படம்", u"பாடம்"]

get_words([u'ம', u'ப', u'ட', u'ம்')  should return [u"மடம்", u"படம்")
get_words([u'ப', u'ம்', u'ட') should return [u"படம்"] 

The order that the words are returned in, or the order that the letters are entered in should not make a difference. 返回单词的顺序或输入字母的顺序不应有所不同。

Although I understand the difference between unicode codepoints and graphemes, I'm not sure how they're handled in regular expressions. 虽然我理解unicode代码点和字形之间的区别,但我不确定它们是如何在正则表达式中处理的。

In this case, I would want to match only those words that are made up of the specific graphemes in the input list, and nothing else (ie the markings that follow a letter should only follow that letter, but the graphemes themselves can occur in any order). 在这种情况下,我想只匹配由输入列表中的特定字素组成的那些单词,而不是其他任何内容(即字母后面的标记只应该跟随该字母,但字母本身可以出现在任何字母中。订购)。

To support characters that can span several Unicode codepoints: 要支持可以跨越多个Unicode代码点的字符:

# -*- coding: utf-8 -*-
import re
import unicodedata
from functools import partial

NFKD = partial(unicodedata.normalize, 'NFKD')

def match(word, letters):
    word, letters = NFKD(word), map(NFKD, letters) # normalize
    return re.match(r"(?:%s)+$" % "|".join(map(re.escape, letters)), word)

words = [u"மரம்", u"மடம்", u"படம்", u"பாடம்"]
get_words = lambda letters: [w for w in words if match(w, letters)]

print(" ".join(get_words([u'ம', u'ப', u'ட', u'ம்'])))
# -> மடம் படம்
print(" ".join(get_words([u'ப', u'ம்', u'ட'])))
# -> படம்

It assumes that the same character can be used zero or more times in a word. 它假设一个单词中可以使用相同的字符零次或多次。

If you want only words that contain exactly given characters: 如果您只想要包含确切给定字符的单词:

import regex # $ pip install regex

chars = regex.compile(r"\X").findall # get all characters

def match(word, letters):
    return sorted(chars(word)) == sorted(letters)

words = ["cat", "dog", "tack", "coat"]

print(" ".join(get_words(['o', 'c', 'a', 't'])))
# -> coat
print(" ".join(get_words(['k', 'c', 't', 'a'])))
# -> tack

Note: there is no cat in the output in this case because cat doesn't use all given characters. 注意:在这种情况下输出中没有cat ,因为cat不使用所有给定的字符。


What does it mean to normalize? 归一化意味着什么? And could you please explain the syntax of the re.match() regex? 你能解释一下re.match()正则表达式的语法吗?

>>> import re
>>> re.escape('.')
'\\.'
>>> c = u'\u00c7'
>>> cc = u'\u0043\u0327'
>>> cc == c
False
>>> re.match(r'%s$' % (c,), cc) # do not match
>>> import unicodedata
>>> norm = lambda s: unicodedata.normalize('NFKD', s)
>>> re.match(r'%s$' % (norm(c),), norm(cc)) # do match
<_sre.SRE_Match object at 0x1364648>
>>> print c, cc
Ç Ç

Without normalization c and cc do not match. 没有标准化ccc不匹配。 The characters are from the unicodedata.normalize() docs . 这些字符来自unicodedata.normalize() docs

EDIT: Okay, don't use any of the answers from here. 编辑:好的,不要使用这里的任何答案。 I wrote them all while thinking Python regular expressions didn't have a word boundary marker, and I tried to work around this lack. 我写这些都是在思考Python正则表达式没有单词边界标记时,我试图解决这个缺点。 Then @Mark Tolonen added a comment that Python has \\b as a word boundary marker! 然后@Mark Tolonen添加了一条评论,说Python有\\b作为单词边界标记! So I posted another answer, short and simple, using \\b . 所以我发布了另一个简短的答案,使用\\b I'll leave this here in case anyone is interested in seeing solutions that work around the lack of \\b , but I don't really expect anyone to be. 我会留在这里以防万一有人有兴趣看到解决方案解决缺乏\\b ,但我真的不希望任何人。


It is easy to make a regular expression that matches only a string of a specific set of characters. 可以很容易地创建一个只匹配特定字符集的字符串的正则表达式。 What you need to use is a "character class" with just the characters you want to match. 你需要使用的是一个“字符类”,只包含你想要匹配的字符。

I'll do this example in English. 我会用英语做这个例子。

[ocat] This is a character class that will match a single character from the set [o, c, a, t] . [ocat]这是一个与集合[o, c, a, t]中的单个字符匹配的字符类。 Order of the characters doesn't matter. 人物的顺序无关紧要。

[ocat]+ Putting a + on the end makes it match one or more characters from the set. [ocat]+在末尾添加+使其与集合中的一个或多个字符匹配。 But this is not enough by itself; 但这本身还不够; if you had the word "coach" this would match and return "coac". 如果你有“教练”这个词,这将匹配并返回“coac”。

Sadly, there isn't a regular expression feature for "word boundary". 遗憾的是,“单词边界”没有正则表达式功能。 [EDIT: This turns out not to be correct, as I said in the first paragraph.] We need to make one of our own. [编辑:事实证明这不是正确的,正如我在第一段中所说的那样。]我们需要制作自己的一个。 There are two possible word beginnings: the start of a line, or whitespace separating our word from the previous word. 有两个可能的单词开头:一行的开头,或者将单词与前一单词分开的空格。 Similarly, there are two possible word endings: end of a line, or whitespace separating our word from the next word. 类似地,有两个可能的单词结尾:一行的结尾,或者将我们的单词与下一个单词分开的空格。

Since we will be matching some extra stuff we don't want, we can put parentheses around the part of the pattern we do want. 由于我们将匹配一些我们不想要的额外内容,我们可以在我们想要的模式部分放置括号。

To match two alternatives, we can make a group in parentheses and separate the alternatives with a vertical bar. 为了匹配两个备选方案,我们可以在括号中创建一个组,并使用竖线分隔备选方案。 Python regular expressions have a special notation to make a group whose contents we don't want to keep: (?:) Python正则表达式有一个特殊的表示法,可以创建一个我们不想保留其内容的组:( (?:)

So, here is the pattern to match the beginning of a word. 所以,这是匹配单词开头的模式。 Start of line or white space: (?:^|\\s) 行首或空格: (?:^|\\s)

Here is the pattern for end of word. 这是单词结尾的模式。 White space or end of line: `(?:\\s|$) 空格或行尾:`(?:\\ s | $)

Putting it all together, here is our final pattern: 总而言之,这是我们的最终模式:

(?:^|\s)([ocat]+)(?:\s|$)

You can build this dynamically. 您可以动态构建它。 You don't need to hard-code the whole thing. 你不需要对整个事情进行硬编码。

import re

s_pat_start = r'(?:^|\s)(['
s_pat_end = r']+)(?:\s|$)'

set_of_chars = get_the_chars_from_somewhere_I_do_not_care_where()
# set_of_chars is now set to the string: "ocat"

s_pat = s_pat_start + set_of_chars + s_pat_end
pat = re.compile(s_pat)

Now, this doesn't in any way check for valid words. 现在,这不会以任何方式检查有效单词。 If you have the following text: 如果您有以下文字:

This is sensible.  This not: occo cttc

The pattern I showed you will match on occo and cttc , and those are not really words. 我给你看的模式将匹配occocttc ,而那些不是真正的单词。 They are strings made only of letters from [ocat] though. 它们只是由[ocat]的字母组成的字符串。

So just do the same thing with Unicode strings. 所以用Unicode字符串做同样的事情。 (If you are using Python 3.x then all strings are Unicode strings, so there you go.) Put the Tamil characters in the character class and you are good to go. (如果你使用的是Python 3.x,那么所有字符串都是Unicode字符串,所以你可以去。)将泰米尔语字符放在字符类中,你就可以了。

This has a confusing problem: re.findall() doesn't return all possible matches. 这有一个令人困惑的问题: re.findall()不会返回所有可能的匹配。

EDIT: Okay, I figured out what was confusing me. 编辑:好的,我想出了令我困惑的事情。

What we want is for our pattern to work with re.findall() so you can collect all the words. 我们想要的是我们的模式与re.findall()这样你就可以收集所有的单词。 But re.findall() only finds non-overlapping patterns. 但是re.findall()只能找到非重叠的模式。 In my example, re.findall() only returned ['occo'] and not ['occo', 'cttc'] as I expected... but this is because my pattern was matching the white space after occo . 在我的例子中, re.findall()只返回['occo']而不是['occo', 'cttc']正如我预期的那样......但这是因为我的模式在occo之后匹配了空格。 The match group didn't collect the white space, but it was matched all the same, and since re.findall() wants no overlap between matches, the white space was "used up" and didn't work for cttc . 匹配组没有收集空格,但是匹配完全相同,并且因为re.findall()希望匹配之间没有重叠,所以空格“用完”并且不适用于cttc

The solution is to use a feature of Python regular expressions that I have never used before: special syntax that says "must not be preceded by" or "must not be followed by". 解决方案是使用我以前从未使用过的Python正则表达式的特性:特殊语法,表示“不能以”开头“或”不得跟随“。 The sequence \\S matches any non-whitespace so we could use that. 序列\\S匹配任何非空格,所以我们可以使用它。 But punctuation is non-whitespace, and I think we do want punctuation to delimit a word. 但标点符号是非空白的,我认为我们确实希望标点符号来划分单词。 There is also special syntax for "must be preceded by" or "must be followed by". 还有“必须先于”或“必须后跟”的特殊语法。 So here is, I think, the best we can do: 所以我认为这是我们能做的最好的事情:

Build a string that means "match when the character class string is at start of line and followed by whitespace, or when character class string is preceded by whitespace and followed by whitespace, or when character class string is preceded by whitespace and followed by end of line, or when character class string is preceded by start of line and followed by end of line". 构建一个字符串,表示“当字符类字符串位于行的开头并且后跟空格时,或者当字符类字符串前面有空格并后跟空格时,或者当字符类字符串前面有空格,后跟结束时,匹配line,或者当字符类字符串前面有行的开头,后跟行尾“。

Here is that pattern using ocat : 这是使用ocat模式:

r'(?:^([ocat]+)(?=\s)|(?<=\s)([ocat]+)(?=\s)|(?<=\s)([ocat]+)$|^([ocat]+)$)'

I'm very sorry but I really do think this is the best we can do and still work with re.findall() ! 我很抱歉,但我确实认为这是我们能做的最好的,仍然可以使用re.findall()

It's actually less confusing in Python code though: 它实际上在Python代码中不那么令人困惑:

import re

NMGROUP_BEGIN = r'(?:'  # begin non-matching group
NMGROUP_END = r')'  # end non-matching group

WS_BEFORE = r'(?<=\s)'  # require white space before
WS_AFTER = r'(?=\s)'  # require white space after

BOL = r'^' # beginning of line
EOL = r'$' # end of line

CCS_BEGIN = r'(['  #begin a character class string
CCS_END = r']+)'  # end a character class string

PAT_OR = r'|'

set_of_chars = get_the_chars_from_somewhere_I_do_not_care_where()
# set_of_chars now set to "ocat"

CCS = CCS_BEGIN + set_of_chars + CCS_END  # build up character class string pattern

s_pat = (NMGROUP_BEGIN +
    BOL + CCS + WS_AFTER + PAT_OR +
    WS_BEFORE + CCS + WS_AFTER + PAT_OR +
    WS_BEFORE + CCS + EOL + PAT_OR +
    BOL + CCS + EOL +
    NMGROUP_END)

pat = re.compile(s_pat)

text = "This is sensible.  This not: occo cttc"

pat.findall(text)
# returns: [('', 'occo', '', ''), ('', '', 'cttc', '')]

So, the crazy thing is that when we have alternative patterns that could match, re.findall() seems to return an empty string for the alternatives that didn't match. 所以,疯狂的是,当我们有可以匹配的替代模式时, re.findall()似乎为不匹配的替代品返回一个空字符串。 So we just need to filter out the length-zero strings from our results: 所以我们只需要从结果中过滤掉长度为零的字符串:

import itertools as it

raw_results = pat.findall(text)
results = [s for s in it.chain(*raw_results) if s]
# results set to: ['occo', 'cttc']

I guess it might be less confusing to just build four different patterns, run re.findall() on each, and join the results together. 我想可能不那么容易构建四种不同的模式,在每个模式上运行re.findall() ,并将结果连接在一起。

EDIT: Okay, here is the code for building four patterns and trying each. 编辑:好的,这是构建四个模式并尝试每个模式的代码。 I think this is an improvement. 我认为这是一个进步。

import re

WS_BEFORE = r'(?<=\s)'  # require white space before
WS_AFTER = r'(?=\s)'  # require white space after

BOL = r'^' # beginning of line
EOL = r'$' # end of line

CCS_BEGIN = r'(['  #begin a character class string
CCS_END = r']+)'  # end a character class string

set_of_chars = get_the_chars_from_somewhere_I_do_not_care_where()
# set_of_chars now set to "ocat"

CCS = CCS_BEGIN + set_of_chars + CCS_END  # build up character class string pattern

lst_s_pat = [
    BOL + CCS + WS_AFTER,
    WS_BEFORE + CCS + WS_AFTER,
    WS_BEFORE + CCS + EOL,
    BOL + CCS
]

lst_pat = [re.compile(s) for s in lst_s_pat]

text = "This is sensible.  This not: occo cttc"

result = []
for pat in lst_pat:
    result.extend(pat.findall(text))

# result set to: ['occo', 'cttc']

EDIT: Okay, here is a very different approach. 编辑:好的,这是一个非常不同的方法。 I like this one best. 我最喜欢这个。

First, we will match all words in the text. 首先,我们将匹配文本中的所有单词。 A word is defined as one or more characters that are not punctuation and are not white space. 单词被定义为一个或多个不是标点符号且不是空格的字符。

Then, we use a filter to remove words from the above; 然后,我们使用过滤器从上面删除单词; we keep only words that are made only of the characters we want. 我们只保留仅由我们想要的字符组成的单词。

import re
import string

# Create a pattern that matches all characters not part of a word.
#
# Note that '-' has a special meaning inside a character class, but it
# is valid punctuation that we want to match, so put in a backslash in
# front of it to disable the special meaning and just match it.
#
# Use '^' which negates all the chars following.  So, a word is a series
# of characters that are all not whitespace and not punctuation.

WORD_BOUNDARY = string.whitespace + string.punctuation.replace('-', r'\-')

WORD = r'[^' + WORD_BOUNDARY + r']+'


# Create a pattern that matches only the words we want.

set_of_chars = get_the_chars_from_somewhere_I_do_not_care_where()
# set_of_chars now set to "ocat"

# build up character class string pattern
CCS = r'[' + set_of_chars + r']+'


pat_word = re.compile(WORD)
pat = re.compile(CCS)

text = "This is sensible.  This not: occo cttc"


# This makes it clear how we are doing this.
all_words = pat_word.findall(text)
result = [s for s in all_words if pat.match(s)]

# "lazy" generator expression that yields up good results when iterated
# May be better for very large texts.
result_genexp = (s for s in (m.group(0) for m in pat_word.finditer(text)) if pat.match(s))

# force the expression to expand out to a list
result = list(result_genexp)

# result set to: ['occo', 'cttc']

EDIT: Now I don't like any of the above solutions; 编辑:现在我不喜欢上述任何解决方案; please see the other answer, the one using \\b , for the best solution in Python. 请参阅另一个答案,即使用\\b答案,以获得Python中的最佳解决方案。

It is easy to make a regular expression that matches only a string of a specific set of characters. 可以很容易地创建一个只匹配特定字符集的字符串的正则表达式。 What you need to use is a "character class" with just the characters you want to match. 你需要使用的是一个“字符类”,只包含你想要匹配的字符。

I'll do this example in English. 我会用英语做这个例子。

[ocat] This is a character class that will match a single character from the set [o, c, a, t] . [ocat]这是一个与集合[o, c, a, t]中的单个字符匹配的字符类。 Order of the characters doesn't matter. 人物的顺序无关紧要。

[ocat]+ Putting a + on the end makes it match one or more characters from the set. [ocat]+在末尾添加+使其与集合中的一个或多个字符匹配。 But this is not enough by itself; 但这本身还不够; if you had the word "coach" this would match and return "coac" . 如果你有"coach"这个词,这将匹配并返回"coac"

\\b[ocat]+\\b' Now it only matches on word boundaries. (Thank you very much @Mark Tolonen for educating me about \\b[ocat]+\\b' Now it only matches on word boundaries. (Thank you very much @Mark Tolonen for educating me about \\b`.) \\b[ocat]+\\b' Now it only matches on word boundaries. (Thank you very much @Mark Tolonen for educating me about \\ b`。)

So, just build up a pattern like the above, only using the desired character set at runtime, and there you go. 因此,只需构建一个类似上面的模式,只在运行时使用所需的字符集,然后就可以了。 You can use this pattern with re.findall() or re.finditer() . 您可以将此模式与re.findall()re.finditer()

import re

words = ["cat", "dog", "tack", "coat"]

def get_words(chars_seq, words_seq=words):
    s_chars = ''.join(chars_seq)
    s_pat = r'\b[' + s_chars + r']+\b'
    pat = re.compile(s_pat)
    return [word for word in words_seq if pat.match(word)]

assert get_words(['o', 'c', 'a', 't']) == ["cat", "coat"]
assert get_words(['k', 'c', 't', 'a']) == ["cat", "tack"]

I would not use regular expressions to solve this problem. 我不会使用正则表达式来解决这个问题。 I would rather use collections.Counter like so: 我宁愿使用collections.Counter像这样:

>>> from collections import Counter
>>> def get_words(word_list, letter_string):
    return [word for word in word_list if Counter(word) & Counter(letter_string) == Counter(word)]
>>> words = ["cat", "dog", "tack", "coat"]
>>> letters = 'ocat'
>>> get_words(words, letters)
['cat', 'coat']
>>> letters = 'kcta'
>>> get_words(words, letters)
['cat', 'tack']

This solution should also work for other languages. 此解决方案也适用于其他语言。 Counter(word) & Counter(letter_string) finds the intersection between the two counters, or the min(c[x], f[x]). Counter(word) & Counter(letter_string)查找两个计数器之间的交集,或min(c [x],f [x])。 If this intersection is equivalent to your word, then you want to return the word as a match. 如果此交集等同于您的单词,则您希望将该单词作为匹配返回。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM