简体   繁体   中英

Find a keyword in a text file and catch the n words after this word

I'm doing a basic text-mining application and I'd need to find a definite word (keyword) and capture just the n words after this word. For example, in this text I'd want to catch the 3 words after the keyword POPULATION:

The Supplemental Tables consist of 59 detailed tables tabulated on the 2016 1-year microdata for geographies with populations of 20,000 people or more. These Supplemental Estimates are available through American FactFinder and the Census Bureau's application programming interface at the same geographic summary levels as those in the American Community Survey.

Next step will be to split the string and find the number, but this is the point I've solved. I've tried with different methods (regex, etc.) with no success. How can I do it?

Split the text into words, find the index of the keyword, grab the words at the next indices:

text = 'The Supplemental Tables consist of 59 detailed tables tabulated on the 2016 1-year microdata for geographies with populations of 20,000 people or more. These Supplemental Estimates are available through American FactFinder and the Census Bureau’s application programming interface at the same geographic summary levels as those in the American Community Survey.'
keyword = 'populations'
words = text.split()
index = words.index(keyword)
wanted_words = words[index + 1:index + 4]

If you wish to make the list of three words wanted_words back into a string, use

wanted_text = ' '.join(wanted_words)

You could use the nltk library.

from nltk.tokenize import word_tokenize

def sample(string, keyword, n):
    output = []
    word_list = word_tokenize(string.lower())
    indices = [i for i, x in enumerate(word_list) if x==keyword]
    for index in indices:
        output.append(word_list[index+1:index+n+1])
    return output


>>>print sample(string, 'populations', 3)
>>>[['of', '20,000', 'people']]
>>>print sample(string, 'tables', 3)
>>>[['consist', 'of', '59'], ['tabulated', 'on', 'the']]

You have two way to solve it

1 using jieba

jieba.cut

it can spilt your sentence to words

just find 'populations' and get next three words

2 using spilt

raw = 'YOUR_TEXT_CONTENT'
raw_list = raw.split(' ')
start = raw_list.index('populations')
print(raw_list[start:start+4])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM