简体   繁体   English

使用RegEx查找无序单词

[英]Find unordered words with RegEx

I want to use RegEx to find the first sequence within a string where a set of words appears, by any order. 我想使用RegEx来查找字符串中的第一个序列,其中出现一组单词,按任何顺序排列。

For example, if looking for the words hello , my and world , then: 例如,如果寻找单词hellomyworld ,那么:

  • for hello my sweet world the expression would match hello my sweet world ; 为了hello my sweet world ,表达式将与hello my sweet world匹配;
  • for oh my, hello world it would match my, hello world ; oh my, hello world它会匹配my, hello world ;
  • for oh my world, hello world it would match my world, hello ; 为了oh my world, hello world它会匹配my world, hello ;
  • for hello world there would be no match. 对于hello world ,没有比赛。

After some research, I tried the expression (?=.*?\\bhello\\b)(?=.*?\\bmy\\b)(?=.*?\\bworld\\b).* , which does not solve my problem, as it matches the whole string if all words are present, as in: 经过一些研究,我尝试了表达式(?=.*?\\bhello\\b)(?=.*?\\bmy\\b)(?=.*?\\bworld\\b).* ,这并没有解决我的问题,如果所有单词都存在,它匹配整个字符串,如:

  • for oh my world, hello world it matches oh my world, hello world 为了oh my world, hello world它匹配oh my world, hello world

What would be the apropriate expression to achieve what I described? 实现我所描述的是什么是恰当的表达?

(Although RegEx is the preferred method for my program, if you think is not the way to go, any other python solution is welcome.) (虽然RegEx是我程序的首选方法,但如果您认为不是这样的话,欢迎任何其他python解决方案。)

Unified iterative pythonic approach by using Pattern.finditer() function and Set object: 使用Pattern.finditer()函数和Set对象的统一迭代pythonic方法:

import re

test_str = '''The introduction here for our novel. 
Oh, hello my friend. This world is full of beauty and mystery, let's say hello to universe ...'''

words_set = {'my', 'hello', 'world'}    # a set of search words
words_set_copy = set(words_set)
pat = re.compile(r'\b(my|hello|world)\b', re.I)
start_pos = None
first_sequence = ''

for m in pat.finditer(test_str):        
    if start_pos is None:
        start_pos = m.start()           # start position of the 1st match object
    words_set_copy.discard(m.group())   # discard found unique match 

    if not words_set_copy:              # all the search words found
        first_sequence += test_str[start_pos: m.end()]
        break

print(first_sequence)

The output: 输出:

hello my friend. This world

You may transform the above approach into a function to make it reusable. 您可以将上述方法转换为函数以使其可重用。

I think this task best gets done with some programming logic and regex wouldn't be easy and efficient. 我认为这个任务最好用一些编程逻辑来完成,而正则表达式并不容易和有效。 But here is a regex that seems to be doing your job and doesn't matter whether you have repeating words (hello my world) present or not, 但这里有一个似乎正在做你的工作的正则表达式,无论你是否有重复的单词(你好我的世界)存在与否,

\b(hello|my|world)\b.*?((?!\1)\b(?:hello|my|world)\b).*?(?:(?!\1)(?!\2)\b(?:hello|my|world)\b)

The idea here is, 这里的想法是,

  1. Make an alternation group \\b(hello|my|world)\\b and put it in group1 创建一个交替组\\b(hello|my|world)\\b并将其放在group1中
  2. Then optionally it can have zero or more any characters following it. 然后可选地,它后面可以有零个或多个任何字符。
  3. Then it must be followed by any of the remaining two words and not the one that got matched in first group which is why I have used ((?!\\1)\\b(?:hello|my|world)\\b) and this second match is then put in group 2. 然后它必须跟随剩下的两个单词中的任何一个,而不是在第一组中匹配的单词,这就是为什么我使用了((?!\\1)\\b(?:hello|my|world)\\b)和然后将第二场比赛放入第2组。
  4. Then again it can have optionally zero or more any characters following it. 然后,它可以选择零个或多个跟随它的任何字符。
  5. Then again we apply the same logic where the third word should be the one that wasn't captured in either group1 or group2, hence this regex (?:(?!\\1)(?!\\2)\\b(?:hello|my|world)\\b) 然后我们再次应用相同的逻辑,其中第三个单词应该是未在group1或group2中捕获的单词,因此这个正则表达式(?:(?!\\1)(?!\\2)\\b(?:hello|my|world)\\b)

Here is a Demo 这是一个演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM