如何在python或R中获得最常见的短语或单词

Question

Given some text, how can i get the most common n-gram across n=1 to 6? 给定一些文本，我怎么能得到n = 1到6之间最常见的n-gram？ I've seen methods to get it for 3-gram, or 2-gram, one n at a time, but is there any way to extract the max-length phrase that makes the most sense, and all the rest too? 我已经见过一些方法可以一次获取3克（或2克）一个n，但是有什么方法可以提取最有意义的最大长度短语，其余的也可以吗？

for example, in this text for demo-purpose only: fri evening commute can be long. some people avoid fri evening commute by choosing off-peak hours. there are much less traffic during off-peak. 例如，在本文中仅用于演示目的： fri evening commute can be long. some people avoid fri evening commute by choosing off-peak hours. there are much less traffic during off-peak. fri evening commute can be long. some people avoid fri evening commute by choosing off-peak hours. there are much less traffic during off-peak.

The ideal outcome of n-gram and their counter would be: n-gram的理想结果及其对策是：

fri evening commute: 3,
off-peak: 2,
rest of the words: 1

any advice appreciated. 任何建议表示赞赏。 Thanks. 谢谢。

Answer 1

Python 蟒蛇

Consider the NLTK library which offers an ngrams function that you can use to iterate over values of n. 考虑NLTK库，它提供了ngrams函数，可用于迭代n的值。

A rough implementation would be along the lines of the following, where rough is the keywords here: 粗略的实现将遵循以下内容，其中rough是此处的关键字：

from nltk import ngrams
from collections import Counter

result = []
sentence = 'fri evening commute can be long. some people avoid fri evening commute by choosing off-peak hours. there are much less traffic during off-peak.'
# Since you are not considering periods and treats words with - as phrases
sentence = sentence.replace('.', '').replace('-', ' ')

for n in range(len(sentence.split(' ')), 1, -1):
    phrases = []

    for token in ngrams(sentence.split(), n):
        phrases.append(' '.join(token))

    phrase, freq = Counter(phrases).most_common(1)[0]
    if freq > 1:
        result.append((phrase, n))
        sentence = sentence.replace(phrase, '')

for phrase, freq in result:
    print('%s: %d' % (phrase, freq))

As for R 至于R

This might be helpful 这可能会有所帮助

Answer 2

如果您打算使用R，我建议您这样做： https : //cran.r-project.org/web/packages/udpipe/vignettes/udpipe-usecase-postagging-lemmatisation.html

如何在python或R中获得最常见的短语或单词

问题描述

2 个解决方案

解决方案1
1 2018-03-31 21:28:51

解决方案2
1 已采纳

如何在python或R中获得最常见的短语或单词

问题描述

2 个解决方案

解决方案1 1 2018-03-31 21:28:51

解决方案2 1 已采纳

解决方案1
1 2018-03-31 21:28:51

解决方案2
1 已采纳