Python正则表达式集合中有0个或更多单词

Question

I have a big block of text within which I am trying to look for a phrase. 我有很多文字要在其中寻找一个短语。 The phrase can be structured in a number of different ways. 可以以多种不同的方式来构造该短语。

First I want to look for a word from a set of words, let's call it set 1. 首先，我想从一组单词中寻找一个单词，我们称其为set 1。
After that there must be a space or comma (or maybe something else that separates words) 在那之后，必须有一个空格或逗号（或其他可能分隔单词的东西）
Then there may be 0 or more words from set 2 那么集合2中可能有0个或多个单词
Again followed by the word separation characters as in point 2 above 再次在上面第2点中加上单词分隔字符
finally there should be a word from set 3 最后应该有一组3的单词

Ideally all of these should be in the same sentence. 理想情况下，所有这些都应该在同一句话中。

set 2 = (for|to|of|full|a|be|complete|Internal) 设置2 = (for|to|of|full|a|be|complete|Internal)

So I have this regex expression 所以我有这个正则表达式

(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,]*(for|to|of|full|a|be|complete|Internal)[ ,]*(renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)

Now this will match a phrase where there is 0 or 1 words from set 2 but not if there are multiples. 现在，这将匹配词组2中有0或1个词的短语，但如果有多个则不匹配。 eg "provides a wonderful opportunity for someone to add their own stamp as the property needs complete renovation throughout." 例如：“由于物业需要全面翻新，因此为人们提供了一个添加自己的邮票的绝好机会。”

as soon as I add in 'a' before 'complete' it fails. 只要在“完成”之前添加“ a”，它就会失败。 The same as if I add another 'complete'. 就像我添加另一个“完成”一样。

How do I specify to look for 0 or multiple words from a set? 如何指定从集合中查找0个或多个单词？

Answer 1

Set 1: Matches any of the words in set 1 with 1 separator. 第1组：将第1组中的任何单词与1个分隔符匹配。

(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,]

Set 2: Matches any of the words in set 2 with 1 separator, 0 or more times. 第2组：将第2组中的任何单词与1个分隔符匹配0次或更多次。

((for|to|of|full|a|be|complete|Internal)[ ,])*

Set 3: Matches any of the words in set 3 第3组：匹配第3组中的任何单词

(renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)

Full: 充分：

(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,]((for|to|of|full|a|be|complete|Internal)[ ,])*(renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)

Answer 2

Long alternatives in regular expressions can be quite slow . 正则表达式中的长替代项可能会很慢。 I'd suggest to take another approach. 我建议采取另一种方法。 First segment the text (split to words) and the iterate over the array of words checking if subsequent sets of 3 words fulfil your requirements 首先将文本分段（拆分为多个单词），然后在单词数组中进行迭代，以检查随后的3个单词集是否满足您的要求

Something like that (rather pseudocode than a real python): 这样的东西（而不是真正的python而是伪代码）：

def check(text):
  words = segment(text)
  for i in range(0, len(text)-2):
      check_word1(text[i]) and check_word1(text[i+1]) and check_word3(text[i+2])

Answer 3

You have to use this regex: 您必须使用此正则表达式：

(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,](for|to|of|full|a|be|complete|Internal)*[ ,](renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)

Because you have one word from first set. 因为您从第一个单词开始就有一个单词。 After that you have one space or comma. 之后，您将有一个空格或逗号。 Near you have 0 or more word from set 2. Then an other space or comma and finally one word from the last set. 您附近有2个集合中的0个或多个单词，然后是另一个空格或逗号，最后是最后一个集合中的一个单词。

Answer 4

Just in case you didn't know, you can use sites like https://regex101.com/ to test your regular expressions, and see why it works/it doesn't. 以防万一，您可以使用https://regex101.com/之类的网站来测试您的正则表达式，并查看其工作原理/不起作用。

In this case, you need the "zero or more" ( * ) operator on your second group. 在这种情况下，第二个组上需要“零个或多个”（ * ）运算符。 The result would be: 结果将是：

(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,]*(for|to|of|full|a|be|complete|Internal)*[ ,]*(renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)

However, considering you probably want the words to be separated, just include the space on the operator (you can use a non-capturing group for that), resulting on: 但是，考虑到您可能希望将单词分开，只需在运算符上加上空格（您可以为此使用非捕获组），结果是：

(Potential|Ability|Possibility|need|requires|needs|plenty|for|Needing|Requiring)[ ,]*(?:(for|to|of|full|a|be|complete|Internal)[ ,]*)*(renovate|improve|modernise|modernize|update|renovating|improving|modernising|modernizing|updating|potential|project|renovation)

Python正则表达式集合中有0个或更多单词

问题描述

4 个解决方案

解决方案1
3 已采纳 2019-01-07 13:56:44

解决方案2
2 2019-01-07 13:57:07

解决方案3
1 2019-01-07 14:13:52

解决方案4
0 2019-01-07 13:59:29

Python正则表达式集合中有0个或更多单词

问题描述

4 个解决方案

解决方案1 3 已采纳 2019-01-07 13:56:44

解决方案2 2 2019-01-07 13:57:07

解决方案3 1 2019-01-07 14:13:52

解决方案4 0 2019-01-07 13:59:29

解决方案1
3 已采纳 2019-01-07 13:56:44

解决方案2
2 2019-01-07 13:57:07

解决方案3
1 2019-01-07 14:13:52

解决方案4
0 2019-01-07 13:59:29