繁体   English   中英

如何在python或R中获得最常见的短语或单词

[英]how to get most common phrases or words in python or R

给定一些文本,我怎么能得到n = 1到6之间最常见的n-gram? 我已经见过一些方法可以一次获取3克(或2克)一个n,但是有什么方法可以提取最有意义的最大长度短语,其余的也可以吗?

例如,在本文中仅用于演示目的: fri evening commute can be long. some people avoid fri evening commute by choosing off-peak hours. there are much less traffic during off-peak. fri evening commute can be long. some people avoid fri evening commute by choosing off-peak hours. there are much less traffic during off-peak.

n-gram的理想结果及其对策是:

fri evening commute: 3,
off-peak: 2,
rest of the words: 1

任何建议表示赞赏。 谢谢。

蟒蛇

考虑NLTK库,它提供了ngrams函数,可用于迭代n的值。

粗略的实现将遵循以下内容,其中rough是此处的关键字:

from nltk import ngrams
from collections import Counter

result = []
sentence = 'fri evening commute can be long. some people avoid fri evening commute by choosing off-peak hours. there are much less traffic during off-peak.'
# Since you are not considering periods and treats words with - as phrases
sentence = sentence.replace('.', '').replace('-', ' ')

for n in range(len(sentence.split(' ')), 1, -1):
    phrases = []

    for token in ngrams(sentence.split(), n):
        phrases.append(' '.join(token))

    phrase, freq = Counter(phrases).most_common(1)[0]
    if freq > 1:
        result.append((phrase, n))
        sentence = sentence.replace(phrase, '')

for phrase, freq in result:
    print('%s: %d' % (phrase, freq))

至于R

这可能会有所帮助

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM