計算給定文本文件中每個單詞的出現次數

Question

我正在尋找使用 os.scandir 讀取的一組文件中的每個單詞的計數

import string 
import os

d = dict() 
  
for filename in os.scandir(directory):
    if filename.path.endswith(".txt"):
        f = open(filename, encoding = 'utf-8-sig')
        lines = f.readlines()
        
for line in lines: 
    line = line.strip() 
    line = line.lower() 
    line = line.translate(line.maketrans("", "", string.punctuation)) 
 
    words = line.split(" ") 

    for word in words: 
        if word in d:  
            d[word] = d[word] + 1
    else: 
count 1 
        d[word] = 1

for key in list(d.keys()): 
    print(key, ":", d[key])

問題：這會打印但列出我不想要的數字，並且由於某種原因沒有計算每個單詞的真實數量，因為我正在查看的文件實際上非常龐大，並且有 500 多個。

上面的結果是 -

operations : 22
 : 1
10q : 5
overview : 1
highlights : 1
covid19 : 12
million : 5
2019 : 1
profile : 1
xray : 1
business : 5
consumables : 1
products : 2
35 : 1
response : 5
only : 2
follows : 1
procedures : 5
safely : 1
guidelines : 2
safety : 2
initiatives : 4
includes : 4
restrictions : 4
demand : 9
36 : 1
necessary : 2
operates : 3
2020 : 8
cash : 14
pandemic : 8
requirements : 1
drivers : 4
growth : 11
time : 7
37 : 1
developed : 1
future : 12
statements : 10
currencies : 2

這丟失了很多數據，我只是想知道我在哪里絆倒了導致這種情況。

任何幫助，將不勝感激。

Answer 1

這是一個使用nltk包的超級簡單方法。

我使用內置示例文本進行測試和演示。 但是，您可以將其包裝在一個函數中，並將文件中的原始文本傳遞給word_tokenize()函數，該函數會將原始文本解析為一個列表。 然后，將該單詞列表傳遞給FreqDist()類以計算單詞頻率分布……或者，單詞計數。

from nltk import corpus, FreqDist, word_tokenize

# Test on the first 50 characters of the Inaugural Address.
text = corpus.inaugural.raw()[:50]
words = word_tokenize(text)
dist = FreqDist(words)

for k, v in dist.items():
    print(k, ':', v))

原文：

'Fellow-Citizens of the Senate and of the House of '

輸出：

Fellow-Citizens : 1
of : 3
the : 2
Senate : 1
and : 1
House : 1

Answer 2

您的代碼循環遍歷文件，但它只將最后一個文件的內容存儲在您的 var“行”中，因為您每次都在更新它。 讀取每個文件后，調用另一個函數並將內容作為輸入傳遞給該函數以節省內存。 不要存儲它然后循環遍歷它。

data = f.readlines()
    for line in data:
        process(line)

Answer 3

看起來基本問題是它沒有正確縮進。

for filename in os.scandir(directory):
    if filename.path.endswith(".txt"):
        f = open(filename, encoding="utf-8-sig")
        lines = f.readlines()

        for line in lines:
            line = line.strip()
            line = line.lower()
            line = line.translate(line.maketrans("", "", string.punctuation))

            words = line.split(" ")

            for word in words:
                if word in d:
                    d[word] = d[word] + 1
                else:
                    # count 1
                    d[word] = 1

另外，不確定count是多少，注釋掉它是有效的。

這對我有用。

計算給定文本文件中每個單詞的出現次數

問題描述

3 個解決方案

解決方案1
1 已采納 2020-09-02 20:19:17

原文：

輸出：

解決方案2
0 2020-09-02 19:05:32

解決方案3
0 2020-09-02 19:11:47

計算給定文本文件中每個單詞的出現次數

問題描述

3 個解決方案

解決方案1 1 已采納 2020-09-02 20:19:17

原文：

輸出：

解決方案2 0 2020-09-02 19:05:32

解決方案3 0 2020-09-02 19:11:47

解決方案1
1 已采納 2020-09-02 20:19:17

解決方案2
0 2020-09-02 19:05:32

解決方案3
0 2020-09-02 19:11:47