[英]Find unordered words with RegEx
I want to use RegEx to find the first sequence within a string where a set of words appears, by any order. 我想使用RegEx来查找字符串中的第一个序列,其中出现一组单词,按任何顺序排列。
For example, if looking for the words hello
, my
and world
, then: 例如,如果寻找单词hello
, my
和world
,那么:
hello my sweet world
the expression would match hello my sweet world
; 为了hello my sweet world
,表达式将与hello my sweet world
匹配; oh my, hello world
it would match my, hello world
; oh my, hello world
它会匹配my, hello world
; oh my world, hello world
it would match my world, hello
; 为了oh my world, hello world
它会匹配my world, hello
; hello world
there would be no match. 对于hello world
,没有比赛。 After some research, I tried the expression (?=.*?\\bhello\\b)(?=.*?\\bmy\\b)(?=.*?\\bworld\\b).*
, which does not solve my problem, as it matches the whole string if all words are present, as in: 经过一些研究,我尝试了表达式(?=.*?\\bhello\\b)(?=.*?\\bmy\\b)(?=.*?\\bworld\\b).*
,这并没有解决我的问题,如果所有单词都存在,它匹配整个字符串,如:
oh my world, hello world
it matches oh my world, hello world
为了oh my world, hello world
它匹配oh my world, hello world
What would be the apropriate expression to achieve what I described? 实现我所描述的是什么是恰当的表达?
(Although RegEx is the preferred method for my program, if you think is not the way to go, any other python solution is welcome.) (虽然RegEx是我程序的首选方法,但如果您认为不是这样的话,欢迎任何其他python解决方案。)
Unified iterative pythonic approach by using Pattern.finditer() function and Set object: 使用Pattern.finditer()函数和Set对象的统一迭代pythonic方法:
import re
test_str = '''The introduction here for our novel.
Oh, hello my friend. This world is full of beauty and mystery, let's say hello to universe ...'''
words_set = {'my', 'hello', 'world'} # a set of search words
words_set_copy = set(words_set)
pat = re.compile(r'\b(my|hello|world)\b', re.I)
start_pos = None
first_sequence = ''
for m in pat.finditer(test_str):
if start_pos is None:
start_pos = m.start() # start position of the 1st match object
words_set_copy.discard(m.group()) # discard found unique match
if not words_set_copy: # all the search words found
first_sequence += test_str[start_pos: m.end()]
break
print(first_sequence)
The output: 输出:
hello my friend. This world
You may transform the above approach into a function to make it reusable. 您可以将上述方法转换为函数以使其可重用。
I think this task best gets done with some programming logic and regex wouldn't be easy and efficient. 我认为这个任务最好用一些编程逻辑来完成,而正则表达式并不容易和有效。 But here is a regex that seems to be doing your job and doesn't matter whether you have repeating words (hello my world) present or not, 但这里有一个似乎正在做你的工作的正则表达式,无论你是否有重复的单词(你好我的世界)存在与否,
\b(hello|my|world)\b.*?((?!\1)\b(?:hello|my|world)\b).*?(?:(?!\1)(?!\2)\b(?:hello|my|world)\b)
The idea here is, 这里的想法是,
\\b(hello|my|world)\\b
and put it in group1 创建一个交替组\\b(hello|my|world)\\b
并将其放在group1中 ((?!\\1)\\b(?:hello|my|world)\\b)
and this second match is then put in group 2. 然后它必须跟随剩下的两个单词中的任何一个,而不是在第一组中匹配的单词,这就是为什么我使用了((?!\\1)\\b(?:hello|my|world)\\b)
和然后将第二场比赛放入第2组。 (?:(?!\\1)(?!\\2)\\b(?:hello|my|world)\\b)
然后我们再次应用相同的逻辑,其中第三个单词应该是未在group1或group2中捕获的单词,因此这个正则表达式(?:(?!\\1)(?!\\2)\\b(?:hello|my|world)\\b)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.