简体   繁体   English

在字符串中搜索并获取Python中匹配前后的2个单词

[英]Search in a string and obtain the 2 words before and after the match in Python

I'm using Python to search some words (also multi-token) in a description (string). 我正在使用Python在描述(字符串)中搜索一些单词(也是多标记)。

To do that I'm using a regex like this 要做到这一点,我正在使用这样的正则表达式

    result = re.search(word, description, re.IGNORECASE)
    if(result):
        print ("Trovato: "+result.group())

But what I need is to obtain the first 2 word before and after the match. 但我需要的是在比赛前后获得前2个单词。 For example if I have something like this: 例如,如果我有这样的事情:

Parking here is horrible, this shop sucks. 停车在这里很可怕,这家店很糟糕。

" here is " is the word that I looking for. 这里是 ”这个词我要找的。 So after I matched it with my regex I need the 2 words (if exists) before and after the match. 所以在我将它与我的正则表达式匹配后,我需要在比赛之前和之后的2个单词(如果存在)。

In the example: Parking here is horrible, this 在这个例子中: 停车在这里很可怕,这个

"Parking" and horrible, this are the words that I need. “停车”,可怕,这是我需要的话。

ATTTENTION The description cab be very long and the pattern "here is" can appear multiple times? 注意说明驾驶室很长,“这里”的模式可以出现多次?

How about string operations? 字符串操作怎么样?

line = 'Parking here is horrible, this shop sucks.'

before, term, after = line.partition('here is')
before = before.rsplit(maxsplit=2)[-2:]
after = after.split(maxsplit=2)[:2]

Result: 结果:

>>> before
['Parking']
>>> after
['horrible,', 'this']

Try this regex: ((?:[az,]+\\s+){0,2})here is\\s+((?:[az,]+\\s*){0,2}) 试试这个正则表达式: ((?:[az,]+\\s+){0,2})here is\\s+((?:[az,]+\\s*){0,2})

with re.findall and re.IGNORECASE set 使用re.findallre.IGNORECASE设置

Demo 演示

Based on your clarification, this becomes a bit more complicated. 根据您的澄清,这变得有点复杂。 The solution below deals with scenarios where the searched pattern may in fact also be in the two preceding or two subsequent words. 下面的解决方案涉及搜索模式实际上也可能在前两个或两个后续单词中的情况。

line = "Parking here is horrible, here is great here is mediocre here is here is "
print line
pattern = "here is"
r = re.search(pattern, line, re.IGNORECASE)
output = []
if r:
    while line:
        before, match, line = line.partition(pattern)
        if match:
            if not output:
                before = before.split()[-2:]
            else:    
                before = ' '.join([pattern, before]).split()[-2:]
            after = line.split()[:2]
            output.append((before, after))
print output

Output from my example would be: 我的例子的输出是:

[(['Parking'], ['horrible,', 'here']), (['is', 'horrible,'], ['great', 'here']), (['is', 'great'], ['mediocre', 'here']), (['is', 'mediocre'], ['here', 'is']), (['here', 'is'], [])] [(['停车'],['可怕,','这里']),(['是','可怕,'],['很棒','这里']),(['是','伟大的'],['平庸','这里']),(['是','平庸'],['这里','是']),(['here','is'],[] )]

I would do it like this ( edit: added anchors to cover most cases ): 我会这样做( 编辑:添加锚点以涵盖大多数情况 ):

(\S+\s+|^)(\S+\s+|)here is(\s+\S+|)(\s+\S+|$)

Like this you will always have 4 groups (might have to be trimmed) with the following behavior: 像这样你将总是有4组(可能需要修剪)具有以下行为:

  1. If group 1 is empty, there was no word before (group 2 is empty too) 如果组1为空,则之前没有单词(组2也为空)
  2. If group 2 is empty, there was only one word before (group 1) 如果组2为空,则之前只有一个单词(组1)
  3. If group 1 and 2 are not empty, they are the words before in order 如果组1和组2不为空,则它们是按顺序排列的单词
  4. If group 3 is empty, there was no word after 如果第3组是空的,那么之后就没有了
  5. If group 4 is empty, there was only one word after 如果第4组为空,则后面只有一个单词
  6. If group 3 and 4 are not empty, they are the words after in order 如果第3组和第4组不为空,则它们是按顺序排列的单词

Corrected demo link 更正了演示链接

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM