简体   繁体   中英

Python: finding the two words following a key word

I'm sure I am missing something obvious here, but I have been staring at this code for a while and cannot find the root of the problem.

I want to search through many strings, find all the occurrences of certain keywords, and for each of these hits, to retrieve (and save) the two words immediately preceding and following the keywords. So far the code I have find those words, but when there is more than one occurrence of the keyword in a string, the code returns two different lists. How can I aggregate those lists at the observation/string level (so that I can match it back to string i)?

Here is a mock example of a sample and desired results. Keyword is "not":

review_list=['I like this book.', 'I do not like this novel, no, I do not.']
results= [[], ['I do not like this I do not']] 

Current results (using code below) would be: results = [[], ['I do not like this'], ['I do not']]

Here is the code (simplified version):

for i in review_list:
    if (" not " or " neither ") in i:
      z = i.split(' ')
      for x in [x for (x, y) in enumerate(z) if find_not in y]:
        neg_1=[(' '.join(z[max(x-numwords,0):x+numwords+1]))]
        neg1.append(neg_1)

    elif (" not " or " neither ") not in i:
      neg_1=[]
      neg1.append(neg_1)

Again, I am certain this is basic, but as a new Python user, any help will be greatly appreciated. Thanks!

To extract only words (removing punctuation) eg from a string such as

'I do not like this novel, no, I do not.'

I recommend regular expressions:

import re
words = re.findall(r'\w+', somestring)

To find all indices at which one word equals not :

indices = [i for i, w in enumerate(words) if w=='not']

To get the two previous and to following words as well, I recommend a set to remove duplications:

allindx = set()
for i in indices:
    for j in range(max(0, i-2), min(i+3, len(words))):
        allindx.add(j)

and finally to get all the words in question into a space-joined string:

result = ' '.join(words[i] for i in sorted(allindx))

Now of course we can put all these tidbits together into a function...:

import re
def twoeachside(somestring, keyword):
    words = re.findall(r'\w+', somestring)
    indices = [i for i, w in enumerate(words) if w=='not']
    allindx = set()
    for i in indices:
        for j in range(max(0, i-2), min(i+3, len(words)):
            allindx.add(j)
    result = ' '.join(words(i) for i in sorted(allindx))
    return result

Of course, this function works on a single sentence. To make a list of results from a list of sentences:

review_list = ['I like this book.', 'I do not like this novel, no, I do not.']
results = [twoeachside(s, 'not') for s in review_list]
assert results == [[], ['I do not like this I do not']]

the last assert of course just being a check that the code works as you desire:-)

EDIT: actually judging from the example you somewhat absurdly require the results' items to be lists with a single string item if non-empty but empty lists if the string in them would be empty. This absolutely weird spec can of course also be met...:

results = [twoeachside(s, 'not') for s in review_list]
results = [[s] if s else [] for s in results]

it just makes no sense whatsoever, but hey!, it's your spec!-)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM