如何在python的500個文本文件中找到500個最常用的單詞？

Question

我在一個目錄中有500個文本文件，我必須在所有文本文件中找到500個最常用的單詞，如何實現？

PS：我進行了很多搜索，但找不到解決方案。

Answer 1

使用collections.Counter ：

import os
from collections import Counter

c, directory = Counter(), 'path_to_your_directory'

for x in os.listdir(directory):
    fname = os.path.join(directory, x)
    if os.path.isfile(fname):
        with open(fname) as f:
            c += Counter(f.read().split())

for word, _ in c.most_common(500):
    print(word)

當然，它將讀取該目錄中找到的每個文件。 如果這不是預期的行為，請使用具有所需模式的glob.glob或glob.iglob而不是os.listdir （請參閱Reut對我的回答的評論）。

Answer 2

這是我想到使用字典進行計數的最直接的方法，其中鍵作為單詞ad表示計數的值：

import os
# word counts are stored in a dictionary
# for fast access and duplication prevention
count = {}
# your text files should be in this folder
DIR = "files"
# iterate over all files in the folder
for filename in os.listdir(DIR):
    with open(os.path.sep.join([DIR, filename]), 'r') as f:
        for line in f.readlines():
            # strip line separators from end of line
            line = line.strip()
            # once we have a line from the file, split it to words, and
            # add word to the scores (if it's new), or increase it's count
            for word in line.split():
                if word in count:
                    # existing
                    count[word] = count[word] + 1
                else:
                    # new
                    count[word] = 1
print sorted(count.items(), key=lambda x: x[1], reverse=True)[:500]

使用collections.Counter （如Padraic建議）：

import os
from collections import Counter

count = Counter()
DIR = "files"
for filename in os.listdir(DIR):
    with open(os.path.sep.join([DIR, filename]), 'r') as f:
        for line in f.readlines():
            line = line.strip()
            # count all words in line
            count.update(line.split())
print count.most_common(500)

Answer 3

您可以為每個新單詞和單詞數組創建一個計數器。 將每個新單詞添加到數組。 使用“ index of”將文本文件中的每個單詞與數組中的單詞進行比較，增加單詞的計數器。 或者，您可以創建一個數組，使用文本文件中的每個NEW單詞填充，將數組的第二個元素用作計數器。

Answer 4

我們可以使用來自集合模塊的Counter方法。

通過glob從目標目錄中只讀文本文件
通過for loop迭代步驟1中的所有文件。
with語句和文件對象的read()方法以讀取模式打開文件。
通過字符串的split()方法split()文件內容，並使用Counter創建可數字典。 Add將兩個計數器加在一起。 https://docs.python.org/2/library/collections.html
通過most_common(3)方法從Counter中獲取最常用的單詞。

碼：

from glob import glob 
from  collections import Counter

p = "/home/vivek/Desktop/test/*.txt"
main_counter = Counter()

for i in glob(p):
    with open(i, "rb") as fp:
        main_counter += Counter(fp.read().split())

print "main_counter:-", main_counter
print "most common 3:-", main_counter.most_common(3)

輸出：-

vivek@vivek:~/Desktop$ python 4.py 
main_counter:- Counter({'This': 3, 'try': 2, 'again.': 2, 'is': 2, 'can': 2, 'file': 2, 'you': 2, 'my': 2, '1': 1, 'this': 1, '2': 1})
most common 3:- [('This', 3), ('try', 2), ('again.', 2)]

如何在python的500個文本文件中找到500個最常用的單詞？

問題描述

4 個解決方案

解決方案1
4 2015-01-03 13:46:15

解決方案2
1 2015-01-03 13:40:43

解決方案3
0 2015-01-03 13:40:47

解決方案4
0 2015-01-03 14:03:16

如何在python的500個文本文件中找到500個最常用的單詞？

問題描述

4 個解決方案

解決方案1 4 2015-01-03 13:46:15

解決方案2 1 2015-01-03 13:40:43

解決方案3 0 2015-01-03 13:40:47

解決方案4 0 2015-01-03 14:03:16

解決方案1
4 2015-01-03 13:46:15

解決方案2
1 2015-01-03 13:40:43

解決方案3
0 2015-01-03 13:40:47

解決方案4
0 2015-01-03 14:03:16