文本中最常見的 n 個單詞

Question

我目前正在學習使用 NLP。 我面臨的問題之一是在文本中找到最常見的 n 個單詞。 考慮以下：

text=['獅子猴象草','虎象獅子水草','獅子草馬爾科夫象猴精','守衛象草招財狼']

假設 n = 2。我不是在尋找最常見的二元組。 我正在搜索文本中一起出現最多的 2 個單詞。 就像，上面的 output 應該給出：

“獅子”和“大象”：3 “大象”和“雜草”：3 “獅子”和“猴子”：2 “大象”和“猴子”：2

等等..

誰能提出一個合適的方法來解決這個問題？

Answer 1

這很棘手，但我為你解決了，我使用空格來檢測 elem 是否包含超過 3 個單詞:-) 因為如果 elem 有 3 個單詞，那么它必須是 2 個空格:-) 在這種情況下，只有 elem 有 2 個單詞將被退回

l = ["hello world", "good night world", "good morning sunshine", "wassap babe"]
for elem in l:

   if elem.count(" ") == 1:
      print(elem)

output

hello world
wassap babe

Answer 2

我建議如下使用Counter和combinations 。

from collections import Counter
from itertools import combinations, chain

text = ['Lion Monkey Elephant Weed', 'Tiger Elephant Lion Water Grass', 'Lion Weed Markov Elephant Monkey Fine', 'Guard Elephant Weed Fortune Wolf']


def count_combinations(text, n_words, n_most_common=None):
    count = []
    for t in text:
        words = t.split()
        combos = combinations(words, n_words)
        count.append([" & ".join(sorted(c)) for c in combos])
    return dict(Counter(sorted(list(chain(*count)))).most_common(n_most_common))

count_combinations(text, 2)

文本中最常見的 n 個單詞

問題描述

2 個解決方案

解決方案1
1

解決方案2
1 已采納 2020-08-14 10:44:40

文本中最常見的 n 個單詞

問題描述

2 個解決方案

解決方案1 1

解決方案2 1 已采納 2020-08-14 10:44:40

解決方案1
1

解決方案2
1 已采納 2020-08-14 10:44:40