简体   繁体   English

Python:在关键字之后找到两个单词

[英]Python: finding the two words following a key word

I'm sure I am missing something obvious here, but I have been staring at this code for a while and cannot find the root of the problem. 我确定我在这里遗漏了一些明显的东西,但是我盯着这段代码已经有一段时间了,却找不到问题的根源。

I want to search through many strings, find all the occurrences of certain keywords, and for each of these hits, to retrieve (and save) the two words immediately preceding and following the keywords. 我想搜索许多字符串,查找所有出现的某些关键字,并针对每个匹配,以检索(并保存)关键字前后的两个单词。 So far the code I have find those words, but when there is more than one occurrence of the keyword in a string, the code returns two different lists. 到目前为止,我在代码中找到了这些单词,但是当一个字符串中出现多个关键字时,该代码将返回两个不同的列表。 How can I aggregate those lists at the observation/string level (so that I can match it back to string i)? 如何在观察/字符串级别汇总这些列表(以便我可以将其匹配回字符串i)?

Here is a mock example of a sample and desired results. 这是一个示例和所需结果的模拟示例。 Keyword is "not": 关键字是“ not”:

review_list=['I like this book.', 'I do not like this novel, no, I do not.']
results= [[], ['I do not like this I do not']] 

Current results (using code below) would be: results = [[], ['I do not like this'], ['I do not']] 当前结果(使用下面的代码)将是:results = [[],['我不喜欢这个'],['我不喜欢]]]

Here is the code (simplified version): 这是代码(简化版):

for i in review_list:
    if (" not " or " neither ") in i:
      z = i.split(' ')
      for x in [x for (x, y) in enumerate(z) if find_not in y]:
        neg_1=[(' '.join(z[max(x-numwords,0):x+numwords+1]))]
        neg1.append(neg_1)

    elif (" not " or " neither ") not in i:
      neg_1=[]
      neg1.append(neg_1)

Again, I am certain this is basic, but as a new Python user, any help will be greatly appreciated. 同样,我确定这是基本的,但是作为Python新用户,我们将不胜感激。 Thanks! 谢谢!

To extract only words (removing punctuation) eg from a string such as 从例如字符串中仅提取单词(删除标点符号)

'I do not like this novel, no, I do not.'

I recommend regular expressions: 我建议使用正则表达式:

import re
words = re.findall(r'\w+', somestring)

To find all indices at which one word equals not : 查找一个词not等于的所有索引:

indices = [i for i, w in enumerate(words) if w=='not']

To get the two previous and to following words as well, I recommend a set to remove duplications: 为了同时获得前两个词和以下两个词,我建议使用set删除重复项:

allindx = set()
for i in indices:
    for j in range(max(0, i-2), min(i+3, len(words))):
        allindx.add(j)

and finally to get all the words in question into a space-joined string: 最后将所有疑问词放入一个空格连接的字符串中:

result = ' '.join(words[i] for i in sorted(allindx))

Now of course we can put all these tidbits together into a function...: 现在,我们当然可以将所有这些花絮放到一个函数中了……:

import re
def twoeachside(somestring, keyword):
    words = re.findall(r'\w+', somestring)
    indices = [i for i, w in enumerate(words) if w=='not']
    allindx = set()
    for i in indices:
        for j in range(max(0, i-2), min(i+3, len(words)):
            allindx.add(j)
    result = ' '.join(words(i) for i in sorted(allindx))
    return result

Of course, this function works on a single sentence. 当然,此功能仅适用于单个句子。 To make a list of results from a list of sentences: 从句子列表中列出结果:

review_list = ['I like this book.', 'I do not like this novel, no, I do not.']
results = [twoeachside(s, 'not') for s in review_list]
assert results == [[], ['I do not like this I do not']]

the last assert of course just being a check that the code works as you desire:-) 最后一个assert当然只是检查代码是否按您期望的方式工作:-)

EDIT: actually judging from the example you somewhat absurdly require the results' items to be lists with a single string item if non-empty but empty lists if the string in them would be empty. 编辑:实际上,从示例来看,您有点荒谬地要求结果的项目为带有单个字符串项的列表 (如果非空,则为空),但如果它们中的字符串为空,则为空列表。 This absolutely weird spec can of course also be met...: 当然也可以满足这个绝对怪异的规范...:

results = [twoeachside(s, 'not') for s in review_list]
results = [[s] if s else [] for s in results]

it just makes no sense whatsoever, but hey!, it's your spec!-) 这根本没有任何意义,但是,嘿,这是您的规范!-)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM