简体   繁体   English

Python正则表达式集合中有0个或更多单词

[英]Python regular expressions 0 or more words from set

I have a big block of text within which I am trying to look for a phrase. 我有很多文字要在其中寻找一个短语。 The phrase can be structured in a number of different ways. 可以以多种不同的方式来构造该短语。

  1. First I want to look for a word from a set of words, let's call it set 1. 首先,我想从一组单词中寻找一个单词,我们称其为set 1。
  2. After that there must be a space or comma (or maybe something else that separates words) 在那之后,必须有一个空格或逗号(或其他可能分隔单词的东西)
  3. Then there may be 0 or more words from set 2 那么集合2中可能有0个或多个单词
  4. Again followed by the word separation characters as in point 2 above 再次在上面第2点中加上单词分隔字符
  5. finally there should be a word from set 3 最后应该有一组3的单词

Ideally all of these should be in the same sentence. 理想情况下,所有这些都应该在同一句话中。

set 1 = (Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring) 设置1 = (Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)

set 2 = (for|to|of|full|a|be|complete|Internal) 设置2 = (for|to|of|full|a|be|complete|Internal)

set 3 = (renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation) 设置3 = (renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)

So I have this regex expression 所以我有这个正则表达式

(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,]*(for|to|of|full|a|be|complete|Internal)[ ,]*(renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)

Now this will match a phrase where there is 0 or 1 words from set 2 but not if there are multiples. 现在,这将匹配词组2中有0或1个词的短语,但如果有多个则不匹配。 eg "provides a wonderful opportunity for someone to add their own stamp as the property needs complete renovation throughout." 例如:“由于物业需要全面翻新,因此为人们提供了一个添加自己的邮票的绝好机会。”

as soon as I add in 'a' before 'complete' it fails. 只要在“完成”之前添加“ a”,它就会失败。 The same as if I add another 'complete'. 就像我添加另一个“完成”一样。

How do I specify to look for 0 or multiple words from a set? 如何指定从集合中查找0个或多个单词?

Set 1: Matches any of the words in set 1 with 1 separator. 第1组:将第1组中的任何单词与1个分隔符匹配。

(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,]

Set 2: Matches any of the words in set 2 with 1 separator, 0 or more times. 第2组:将第2组中的任何单词与1个分隔符匹配0次或更多次。

((for|to|of|full|a|be|complete|Internal)[ ,])*

Set 3: Matches any of the words in set 3 第3组:匹配第3组中的任何单词

(renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)

Full: 充分:

(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,]((for|to|of|full|a|be|complete|Internal)[ ,])*(renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)

Long alternatives in regular expressions can be quite slow . 正则表达式中的长替代项可能会很慢 I'd suggest to take another approach. 我建议采取另一种方法。 First segment the text (split to words) and the iterate over the array of words checking if subsequent sets of 3 words fulfil your requirements 首先将文本分段(拆分为多个单词),然后在单词数组中进行迭代,以检查随后的3个单词集是否满足您的要求

Something like that (rather pseudocode than a real python): 这样的东西(而不是真正的python而是伪代码):

def check(text):
  words = segment(text)
  for i in range(0, len(text)-2):
      check_word1(text[i]) and check_word1(text[i+1]) and check_word3(text[i+2])

You have to use this regex: 您必须使用此正则表达式:

(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,](for|to|of|full|a|be|complete|Internal)*[ ,](renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)

Because you have one word from first set. 因为您从第一个单词开始就有一个单词。 After that you have one space or comma. 之后,您将有一个空格或逗号。 Near you have 0 or more word from set 2. Then an other space or comma and finally one word from the last set. 您附近有2个集合中的0个或多个单词,然后是另一个空格或逗号,最后是最后一个集合中的一个单词。

Just in case you didn't know, you can use sites like https://regex101.com/ to test your regular expressions, and see why it works/it doesn't. 以防万一,您可以使用https://regex101.com/之类的网站来测试您的正则表达式,并查看其工作原理/不起作用。

In this case, you need the "zero or more" ( * ) operator on your second group. 在这种情况下,第二个组上需要“零个或多个”( * )运算符 The result would be: 结果将是:

(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,]*(for|to|of|full|a|be|complete|Internal)*[ ,]*(renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)

However, considering you probably want the words to be separated, just include the space on the operator (you can use a non-capturing group for that), resulting on: 但是,考虑到您可能希望将单词分开,只需在运算符上加上空格(您可以为此使用非捕获组 ),结果是:

(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,]*(?:(for|to|of|full|a|be|complete|Internal)[ ,]*)*(renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM