I am currently learning to work with NLP. One of the problems I am facing is finding most common n words in text. Consider the following:
text=['Lion Monkey Elephant Weed','Tiger Elephant Lion Water Grass','Lion Weed Markov Elephant Monkey Fine','Guard Elephant Weed Fortune Wolf']
Suppose n = 2. I am not looking for most common bigrams. I am searching for 2-words that occur together the most in the text. Like, the output for the above should give:
'Lion' & 'Elephant': 3 'Elephant' & 'Weed': 3 'Lion' & 'Monkey': 2 'Elephant' & 'Monkey': 2
and such..
Could anyone suggest a suitable way to tackle this?
it was tricky but I solved for you, I used empty space to detect if elem contains more than 3 words:-) cause if elem has 3 words then it must be 2 empty spaces:-) in that case, only elem with 2 words will be returned
l = ["hello world", "good night world", "good morning sunshine", "wassap babe"]
for elem in l:
if elem.count(" ") == 1:
print(elem)
output
hello world
wassap babe
I would suggest using Counter
and combinations
as follows.
from collections import Counter
from itertools import combinations, chain
text = ['Lion Monkey Elephant Weed', 'Tiger Elephant Lion Water Grass', 'Lion Weed Markov Elephant Monkey Fine', 'Guard Elephant Weed Fortune Wolf']
def count_combinations(text, n_words, n_most_common=None):
count = []
for t in text:
words = t.split()
combos = combinations(words, n_words)
count.append([" & ".join(sorted(c)) for c in combos])
return dict(Counter(sorted(list(chain(*count)))).most_common(n_most_common))
count_combinations(text, 2)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.