在 Python 的輸入文件中查找最常見的多詞

Question

假設我有一個文本文件，我可以使用 Counter 輕松找到最常用的單詞。 但是，我也想找到諸如“納稅年度、飛釣、美國國會大廈等”之類的多個詞。 一起出現最多的詞。

import re
from collections import Counter

with open('full.txt') as f:
    passage = f.read()

words = re.findall(r'\w+', passage)

cap_words = [word for word in words]

word_counts = Counter(cap_words)

for k, v in word_counts.most_common():
    print(k, v)

我目前有這個，但是，這只找到一個詞。 如何找到多個單詞？

Answer 1

您正在尋找的是一種計算二元組（包含兩個單詞的字符串）的方法。

nltk庫非常適合執行許多與語言相關的任務，您可以使用collections的Counter進行所有與計數相關的活動！

import nltk
from nltk import bigrams
from collections import Counter

tokens = nltk.word_tokenize(passage)
print(Counter(bigrams(tokens))

Answer 2

你所說的多詞（沒有這樣的東西）實際上被稱為二元組。 您可以通過使用位移將其與自身壓縮來從單詞列表中獲取二元組列表：

bigrams = [f"{x} {y}" for x,y, in zip(words, words[1:])]

PS NLTK 確實是獲得二元組的更好工具。

在 Python 的輸入文件中查找最常見的多詞

問題描述

2 個解決方案

解決方案1
3 已采納 2021-05-18 18:28:25

解決方案2
0 2021-05-18 18:29:10

在 Python 的輸入文件中查找最常見的多詞

問題描述

2 個解決方案

解決方案1 3 已采納 2021-05-18 18:28:25

解決方案2 0 2021-05-18 18:29:10

解決方案1
3 已采納 2021-05-18 18:28:25

解決方案2
0 2021-05-18 18:29:10