简体   繁体   English

Python RegEx,匹配字符串中的单词并获取计数

[英]Python RegEx, match words in string and get count

I want to match a list of words with an string and get how many of the words are matched. 我希望将一个单词列表与一个字符串匹配,并获得匹配的单词数量。

Now I have this: 现在我有这个:

import re
words = ["red", "blue"]
exactMatch = re.compile(r'\b%s\b' % '\\b|\\b'.join(words), flags=re.IGNORECASE)
print exactMatch.search("my blue cat")
print exactMatch.search("my red car")
print exactMatch.search("my red and blue monkey")
print exactMatch.search("my yellow dog")

My current regex will match the first 3, but I would like to find out how many of the words in the list words that matches the string passed to search . 我当前的正则表达式将与前3个匹配,但我想知道列表words有多少与匹配传递给search的字符串相匹配的words Is this possible without making a new re.compile for each word in the list? 如果没有为列表中的每个单词创建一个新的re.compile ,这是否可行?

Or is there another way to achieve the same thing? 还是有另一种方法可以实现同样的目的吗?

The reason I want to keep the number of re.compile to a minimum is speed , since in my application I have multiple word lists and about 3500 strings to search against. 我想将re.compile的数量保持在最低限度的原因是速度 ,因为在我的应用程序中,我有多个单词列表和大约3500个要搜索的字符串。

If you use findall instead of search , then you get a tuple as result containing all the matched words. 如果你使用findall而不是search ,那么你得到一个包含所有匹配单词的结果的元组。

print exactMatch.findall("my blue cat")
print exactMatch.findall("my red car")
print exactMatch.findall("my red and blue monkey")
print exactMatch.findall("my yellow dog")

will result in 会导致

['blue'] ['蓝色']
['red'] ['红色']
['red', 'blue'] ['红蓝']
[] []

If you need to get the amount of matches you get them using len() 如果你需要获得匹配的数量,你可以使用len()

print len(exactMatch.findall("my blue cat"))
print len(exactMatch.findall("my red car"))
print len(exactMatch.findall("my red and blue monkey"))
print len(exactMatch.findall("my yellow dog"))

will result in 会导致

1 1
1 1
2 2
0 0

If I got right the question, you only want to know the number of matches of blue or red in a sentence. 如果我对这个问题说得对,你只想知道句子中蓝色或红色的匹配数量。

>>> exactMatch = re.compile(r'%s' % '|'.join(words), flags=re.IGNORECASE)
>>> print exactMatch.findall("my blue blue cat")
['blue', 'blue']
>>> print len(exactMatch.findall("my blue blue cat"))
2

You need more code if you want to test multiple colors 如果要测试多种颜色,则需要更多代码

Why not storing all words in a hash and iterate a lookup of every words in sentences thru a finditer 为什么不将所有单词存储在哈希中,并通过查找器迭代查找句子中的每个单词

  words = { "red": 1 .... }
  word = re.compile(r'\b(\w+)\b')
  for i in word.finditer(sentence): 
     if words.get(i.group(1)):
       ....
for w in words:
    if w in searchterm:
        print "found"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM