[英]Counting the Frequency of three words
我有下面的代码来查找两个单词短语的频率。 我需要对三个单词短语做同样的事情。
但是,下面的代码似乎不适用于3个单词短语。
from collections import Counter
import re
sentence = "I love TV show makes me happy, I love also comedy show makes me feel like flying"
words = re.findall(r'\w+', sentence)
two_words = [' '.join(ws) for ws in zip(words, words[1:])]
wordscount = {w:f for w, f in Counter(two_words).most_common() if f > 1}
wordscount
{'show makes': 2, 'makes me': 2, 'I love': 2}
我建议将功能分解为单独的功能 :
def nwise(iterable, n):
"""
Iterate over n-grams of an iterable.
Has a bit of an overhead compared to pairwise (although only during
initialization), so the two functions are implemented independently.
"""
iterables = [iter(iterable) for _ in range(n)]
for index, it in enumerate(iterables):
for _ in range(index):
next(it)
yield from zip(*iterables)
那你可以做
two_words = [" ".join(bigram) for bigram in nwise(words, 2))]
和
three_words = [" ".join(trigram) for trigram in nwise(words, 3))]
等等。 然后,您可以使用collections.Counter
:
three_word_counts = Counter(" ".join(trigram) for trigram in nwise(words, 3))
您可以在3个单词分组的可迭代上使用collections.Counter
。 后者通过生成器理解和列表切片构建。
from collections import Counter
three_words = (words[i:i+3] for i in range(len(words)-2))
counts = Counter(map(tuple, three_words))
wordscount = {' '.join(word): freq for word, freq in counts.items() if freq > 1}
print(wordscount)
{'show makes me': 2}
请注意,我们不会在最后使用str.join
以避免不必要的重复字符串操作。 此外, Counter
需要tuple
转换,因为dict
键必须是可清除的。
尝试zip(words, words[1:], words[2:])
例如:
from collections import Counter
import re
sentence = "I love TV show makes me happy, I love also comedy show makes me feel like flying"
words = re.findall(r'\w+', sentence)
three_words = [' '.join(ws) for ws in zip(words, words[1:], words[2:])]
wordscount = {w:f for w, f in Counter(three_words).most_common() if f > 1}
print( wordscount )
输出:
{'show makes me': 2}
关于什么:
from collections import Counter
sentence = "I love TV show makes me happy, I love also comedy show makes me feel like flying"
words = sentence.split()
r = Counter([' '.join(words[i:i+3]) for i in range(len(words)-3)])
>>> r.most_common()[0] #get the most common 3-words
('show makes me', 2)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.