I'm doing a basic text-mining application and I'd need to find a definite word (keyword) and capture just the n words after this word. For example, in this text I'd want to catch the 3 words after the keyword POPULATION:
The Supplemental Tables consist of 59 detailed tables tabulated on the 2016 1-year microdata for geographies with populations of 20,000 people or more. These Supplemental Estimates are available through American FactFinder and the Census Bureau's application programming interface at the same geographic summary levels as those in the American Community Survey.
Next step will be to split the string and find the number, but this is the point I've solved. I've tried with different methods (regex, etc.) with no success. How can I do it?
Split the text into words, find the index of the keyword, grab the words at the next indices:
text = 'The Supplemental Tables consist of 59 detailed tables tabulated on the 2016 1-year microdata for geographies with populations of 20,000 people or more. These Supplemental Estimates are available through American FactFinder and the Census Bureau’s application programming interface at the same geographic summary levels as those in the American Community Survey.'
keyword = 'populations'
words = text.split()
index = words.index(keyword)
wanted_words = words[index + 1:index + 4]
If you wish to make the list of three words wanted_words
back into a string, use
wanted_text = ' '.join(wanted_words)
You could use the nltk library.
from nltk.tokenize import word_tokenize
def sample(string, keyword, n):
output = []
word_list = word_tokenize(string.lower())
indices = [i for i, x in enumerate(word_list) if x==keyword]
for index in indices:
output.append(word_list[index+1:index+n+1])
return output
>>>print sample(string, 'populations', 3)
>>>[['of', '20,000', 'people']]
>>>print sample(string, 'tables', 3)
>>>[['consist', 'of', '59'], ['tabulated', 'on', 'the']]
You have two way to solve it
1 using jieba
jieba.cut
it can spilt your sentence to words
just find 'populations' and get next three words
2 using spilt
raw = 'YOUR_TEXT_CONTENT'
raw_list = raw.split(' ')
start = raw_list.index('populations')
print(raw_list[start:start+4])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.