简体   繁体   English

Python - 在文本文件中查找单词列表的词频

[英]Python - Finding word frequencies of list of words in text file

I am trying to speed up my project to count word frequencies.我正在尝试加快我的项目来计算词频。 I have 360+ text files, and I need to get the total number of words and the number of times each word from another list of words appears.我有 360 多个文本文件,我需要获取单词总数和另一个单词列表中每个单词出现的次数。 I know how to do this with a single text file.我知道如何使用单个文本文件执行此操作。

>>> import nltk
>>> import os
>>> os.chdir("C:\Users\Cameron\Desktop\PDF-to-txt")
>>> filename="1976.03.txt"
>>> textfile=open(filename,"r")
>>> inputString=textfile.read()
>>> word_list=re.split('\s+',file(filename).read().lower())
>>> print 'Words in text:', len(word_list)
#spits out number of words in the textfile
>>> word_list.count('inflation')
#spits out number of times 'inflation' occurs in the textfile
>>>word_list.count('jobs')
>>>word_list.count('output')

Its too tedious to get the frequencies of 'inflation', 'jobs', 'output' individual.获取“通货膨胀”、“工作”、“产出”个人的频率太繁琐了。 Can I put these words into a list and find the frequency of all the words in the list at the same time?我可以将这些单词放入一个列表中并同时查找列表中所有单词的频率吗? Basically this with Python.基本上与Python。

Example: Instead of this:示例: 而不是这样:

>>> word_list.count('inflation')
3
>>> word_list.count('jobs')
5
>>> word_list.count('output')
1

I want to do this (I know this isn't real code, this is what I'm asking for help on):我想这样做(我知道这不是真正的代码,这就是我要寻求帮助的):

>>> list1='inflation', 'jobs', 'output'
>>>word_list.count(list1)
'inflation', 'jobs', 'output'
3, 5, 1

My list of words is going to have 10-20 terms, so I need to be able to just point Python toward a list of words to get the counts of.我的单词列表将有 10-20 个术语,所以我需要能够将 Python 指向一个单词列表以获取计数。 It would also be nice if the output was able to be copy+paste into an excel spreadsheet with the words as columns and frequencies as rows如果输出能够复制并粘贴到 Excel 电子表格中,其中单词作为列,频率作为行,那也很好

Example:例子:

inflation, jobs, output
3, 5, 1

And finally, can anyone help automate this for all of the textfiles?最后,任何人都可以帮助所有文本文件自动化吗? I figure I just point Python toward the folder and it can do the above word counting from the new list for each of the 360+ text files.我想我只是将 Python 指向文件夹,它可以从新列表中为每个 360+ 文本文件执行上述字数统计。 Seems easy enough, but I'm a bit stuck.似乎很容易,但我有点卡住了。 Any help?有什么帮助吗?

An output like this would be fantastic: Filename1 inflation, jobs, output 3, 5, 1像这样的输出会很棒:文件名 1 通货膨胀,工作,输出 3, 5, 1

Filename2
inflation, jobs, output
7, 2, 4

Filename3
inflation, jobs, output
9, 3, 5

Thanks!谢谢!

collections.Counter() has this covered if I understand your problem.如果我理解你的问题, collections.Counter()有这个。

The example from the docs would seem to match your problem.文档中的示例似乎与您的问题相符。

# Tally occurrences of words in a list
cnt = Counter()
for word in ['red', 'blue', 'red', 'green', 'blue', 'blue']:
    cnt[word] += 1
print cnt


# Find the ten most common words in Hamlet
import re
words = re.findall('\w+', open('hamlet.txt').read().lower())
Counter(words).most_common(10)

From the example above you should be able to do:从上面的例子你应该能够做到:

import re
import collections
words = re.findall('\w+', open('1976.03.txt').read().lower())
print collections.Counter(words)

EDIT naive approach to show one way.编辑天真的方法以显示一种方式。

wanted = "fish chips steak"
cnt = Counter()
words = re.findall('\w+', open('1976.03.txt').read().lower())
for word in words:
    if word in wanted:
        cnt[word] += 1
print cnt

One possible implementation (using Counter)...一种可能的实现(使用计数器)...

Instead of printing the output, I think it would be simpler to write to a csv file and import that into Excel.我认为写入 csv 文件并将其导入 Excel 会更简单,而不是打印输出。 Look at http://docs.python.org/2/library/csv.html and replace print_summary .查看http://docs.python.org/2/library/csv.html并替换print_summary

import os
from collections import Counter
import glob

def word_frequency(fileobj, words):
    """Build a Counter of specified words in fileobj"""
    # initialise the counter to 0 for each word
    ct = Counter(dict((w, 0) for w in words))
    file_words = (word for line in fileobj for word in line.split())
    filtered_words = (word for word in file_words if word in words)
    return Counter(filtered_words)


def count_words_in_dir(dirpath, words, action=None):
    """For each .txt file in a dir, count the specified words"""
    for filepath in glob.iglob(os.path.join(dirpath, '*.txt')):
        with open(filepath) as f:
            ct = word_frequency(f, words)
            if action:
                action(filepath, ct)


def print_summary(filepath, ct):
    words = sorted(ct.keys())
    counts = [str(ct[k]) for k in words]
    print('{0}\n{1}\n{2}\n\n'.format(
        filepath,
        ', '.join(words),
        ', '.join(counts)))


words = set(['inflation', 'jobs', 'output'])
count_words_in_dir('./', words, action=print_summary)

A simple functional code to count word frequencies in a text file:一个简单的函数代码来计算文本文件中的词频:

{
import string

def process_file(filename):
hist = dict()
f = open(filename,'rb')
for line in f:
    process_line(line,hist)
return hist

def process_line(line,hist):

line = line.replace('-','.')

for word in line.split():
    word = word.strip(string.punctuation + string.whitespace)
    word.lower()

    hist[word] = hist.get(word,0)+1

hist = process_file(filename)
print hist
}
import re, os, sys, codecs, fnmatch
import decimal
import zipfile
import glob
import csv

path= 'C:\\Users\\user\\Desktop\\sentiment2020\\POSITIVE'

files=[]
for r,d,f in os.walk(path):
    for file in f:
        if'.txt' in  file:
            files.append(os.path.join(r,file))

for f in files:
    print(f)
    file1= codecs.open(f,'r','utf8',errors='ignore')
    content=file1.read()

words=content.split()
for x in words:
    print (x)

dicts=[]
if __name__=="__main__":  
    str =words
    str2 = [] 
    for i in str:              
        if i not in str2: 
              str2.append(i)  
    for i in range(0, len(str2)):
        a= {str2[i]:str.count(str2[i])}
        dicts.append(a)
for i in dicts:        
    print(dicts)



#  for i in range(len(files)):
  #    with codecs.open('C:\\Users\\user\\Desktop\\sentiment2020\\NEGETIVE1\\sad1%s.txt' % i, 'w',"utf8") as filehandle:
  #         filehandle.write('%s\n' % dicts) 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM