Find a keyword in a text file and catch the n words after this word

Question

I'm doing a basic text-mining application and I'd need to find a definite word (keyword) and capture just the n words after this word. For example, in this text I'd want to catch the 3 words after the keyword POPULATION:

The Supplemental Tables consist of 59 detailed tables tabulated on the 2016 1-year microdata for geographies with populations of 20,000 people or more. These Supplemental Estimates are available through American FactFinder and the Census Bureau's application programming interface at the same geographic summary levels as those in the American Community Survey.

Next step will be to split the string and find the number, but this is the point I've solved. I've tried with different methods (regex, etc.) with no success. How can I do it?

Answer 1

Split the text into words, find the index of the keyword, grab the words at the next indices:

text = 'The Supplemental Tables consist of 59 detailed tables tabulated on the 2016 1-year microdata for geographies with populations of 20,000 people or more. These Supplemental Estimates are available through American FactFinder and the Census Bureau’s application programming interface at the same geographic summary levels as those in the American Community Survey.'
keyword = 'populations'
words = text.split()
index = words.index(keyword)
wanted_words = words[index + 1:index + 4]

If you wish to make the list of three words wanted_words back into a string, use

wanted_text = ' '.join(wanted_words)

Answer 2

You could use the nltk library.

from nltk.tokenize import word_tokenize

def sample(string, keyword, n):
    output = []
    word_list = word_tokenize(string.lower())
    indices = [i for i, x in enumerate(word_list) if x==keyword]
    for index in indices:
        output.append(word_list[index+1:index+n+1])
    return output


>>>print sample(string, 'populations', 3)
>>>[['of', '20,000', 'people']]
>>>print sample(string, 'tables', 3)
>>>[['consist', 'of', '59'], ['tabulated', 'on', 'the']]

Answer 3

You have two way to solve it

1 using jieba

jieba.cut

it can spilt your sentence to words

just find 'populations' and get next three words

2 using spilt

raw = 'YOUR_TEXT_CONTENT'
raw_list = raw.split(' ')
start = raw_list.index('populations')
print(raw_list[start:start+4])

Find a keyword in a text file and catch the n words after this word

Question

3 answers

solution1
2 ACCPTED 2017-10-23 15:49:26

solution2
1 2017-10-23 16:00:54

solution3
1 2017-10-23 16:02:00

Find a keyword in a text file and catch the n words after this word

Question

3 answers

solution1 2 ACCPTED 2017-10-23 15:49:26

solution2 1 2017-10-23 16:00:54

solution3 1 2017-10-23 16:02:00

solution1
2 ACCPTED 2017-10-23 15:49:26

solution2
1 2017-10-23 16:00:54

solution3
1 2017-10-23 16:02:00