简体   繁体   English

如何在python的500个文本文件中找到500个最常用的单词?

[英]How to find 500 most frequent words in 500 text files in python?

I have 500 text files in one directory.I have to find 500 most frequent words in all of the text files combined.How can I achieve that? 我在一个目录中有500个文本文件,我必须在所有文本文件中找到500个最常用的单词,如何实现?

PS: I have searched a lot but could not find a solution. PS:我进行了很多搜索,但找不到解决方案。

Use collections.Counter : 使用collections.Counter

import os
from collections import Counter

c, directory = Counter(), 'path_to_your_directory'

for x in os.listdir(directory):
    fname = os.path.join(directory, x)
    if os.path.isfile(fname):
        with open(fname) as f:
            c += Counter(f.read().split())

for word, _ in c.most_common(500):
    print(word)

Of course, it will read every file found in that directory. 当然,它将读取该目录中找到的每个文件。 If that's not the intended behavior, use glob.glob or glob.iglob with the required pattern instead of os.listdir (see Reut's comment to my answer). 如果这不是预期的行为,请使用具有所需模式的glob.globglob.iglob而不是os.listdir (请参阅Reut对我的回答的评论)。

This is the most straightforward way I could think of using a dictionary for the count, with the key as the word ad the value for the count: 这是我想到使用字典进行计数的最直接的方法,其中键作为单词ad表示计数的值:

import os
# word counts are stored in a dictionary
# for fast access and duplication prevention
count = {}
# your text files should be in this folder
DIR = "files"
# iterate over all files in the folder
for filename in os.listdir(DIR):
    with open(os.path.sep.join([DIR, filename]), 'r') as f:
        for line in f.readlines():
            # strip line separators from end of line
            line = line.strip()
            # once we have a line from the file, split it to words, and
            # add word to the scores (if it's new), or increase it's count
            for word in line.split():
                if word in count:
                    # existing
                    count[word] = count[word] + 1
                else:
                    # new
                    count[word] = 1
print sorted(count.items(), key=lambda x: x[1], reverse=True)[:500]

Using collections.Counter (as Padraic suggested): 使用collections.Counter (如Padraic建议):

import os
from collections import Counter

count = Counter()
DIR = "files"
for filename in os.listdir(DIR):
    with open(os.path.sep.join([DIR, filename]), 'r') as f:
        for line in f.readlines():
            line = line.strip()
            # count all words in line
            count.update(line.split())
print count.most_common(500)

You could create a counter for each new word, and an array of words. 您可以为每个新单词和单词数组创建一个计数器。 Add each New word to the array. 将每个新单词添加到数组。 Compare each word In the text file(s) to the words in the array using "index of", increment the counter for the word. 使用“ index of”将文本文件中的每个单词与数组中的单词进行比较,增加单词的计数器。 Or you could create one array, populate with every NEW word from the text files, use second element of the array as a counter. 或者,您可以创建一个数组,使用文本文件中的每个NEW单词填充,将数组的第二个元素用作计数器。

We can use Counter method from collections module. 我们可以使用来自集合模块的Counter方法。

  1. Read only text files from target directory by glob 通过glob从目标目录中只读文本文件
  2. Iterate all files from step 1 by for loop . 通过for loop迭代步骤1中的所有文件。
  3. Open file in read mode by with statement and read() method of file object. with语句和文件对象的read()方法以读取模式打开文件。
  4. Split content of file by split() method of string and use Counter to create countable dictionary. 通过字符串的split()方法split()文件内容,并使用Counter创建可数字典。 Add add two counters together. Add将两个计数器加在一起。 https://docs.python.org/2/library/collections.html https://docs.python.org/2/library/collections.html
  5. Get most common word from the Counter by most_common(3) method. 通过most_common(3)方法从Counter中获取最常用的单词。

code: 码:

from glob import glob 
from  collections import Counter

p = "/home/vivek/Desktop/test/*.txt"
main_counter = Counter()

for i in glob(p):
    with open(i, "rb") as fp:
        main_counter += Counter(fp.read().split())

print "main_counter:-", main_counter
print "most common 3:-", main_counter.most_common(3)

output:- 输出:-

vivek@vivek:~/Desktop$ python 4.py 
main_counter:- Counter({'This': 3, 'try': 2, 'again.': 2, 'is': 2, 'can': 2, 'file': 2, 'you': 2, 'my': 2, '1': 1, 'this': 1, '2': 1})
most common 3:- [('This', 3), ('try', 2), ('again.', 2)]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM