简体   繁体   English

使用python仅计数一次文本文件中的每个单词

[英]Counting every word in a text file only once using python

I have a small python script I am working on for a class homework assignment. 我有一个小的Python脚本正在上课作业。 The script reads a file and prints the 10 most frequent and infrequent words and their frequencies. 该脚本读取文件并打印10个最频繁和不频繁的单词及其频率。 For this assignment, a word is defined as 2 letters or more. 对于此分配,将单词定义为2个字母或更多。 I have the word frequencies working just fine, however the third part of the assignment is to print the total number of unique words in the document. 我的单词频率工作得很好,但是作业的第三部分是打印文档中唯一单词的总数。 Unique words meaning count every word in the document, only once. 唯一单词的含义对文档中的每个单词计数一次。

Without changing my current script too much, how can I count all the words in the document only one time? 在不更改当前脚本的情况下,如何只计算一次文档中的所有单词?

ps I am using Python 2.6 so please don't mention the use of collections.Counter ps我正在使用Python 2.6,所以请不要提及collections.Counter的使用

from string import punctuation
from collections import defaultdict
import re

number = 10
words = {}
total_unique = 0
words_only = re.compile(r'^[a-z]{2,}$')
counter = defaultdict(int)


"""Define words as 2+ letters"""
def count_unique(s):
    count = 0
    if word in line:
        if len(word) >= 2:
            count += 1
    return count


"""Open text document, read it, strip it, then filter it"""
txt_file = open('charactermask.txt', 'r')

for line in txt_file:
    for word in line.strip().split():
        word = word.strip(punctuation).lower()
        if words_only.match(word):
               counter[word] += 1


# Most Frequent Words
top_words = sorted(counter.iteritems(),
                    key=lambda(word, count): (-count, word))[:number] 

print "Most Frequent Words: "

for word, frequency in top_words:
    print "%s: %d" % (word, frequency)


# Least Frequent Words:
least_words = sorted(counter.iteritems(),
                    key=lambda (word, count): (count, word))[:number]

print " "
print "Least Frequent Words: "

for word, frequency in least_words:
    print "%s: %d" % (word, frequency)


# Total Unique Words:
print " "
print "Total Number of Unique Words: %s " % total_unique

Count the number of key s in your counter dictionary: 计算counter字典中key s的数量:

total_unique = len(counter.keys())

Or more simply: 或更简单地说:

total_unique = len(counter)

A defaultdict is great, but it might be more that what you need. defaultdict很不错,但可能还不止您需要。 You will need it for the part about most frequent words. 对于最常见的单词,您将需要它。 But in the absence of that question, using a defaultdict is overkill. 但是在没有这个问题的情况下,使用defaultdict是过大的。 In such a situation, I would suggest using a set instead: 在这种情况下,我建议改用set

words = set()
for line in txt_file:
    for word in line.strip().split():
        word = word.strip(punctuation).lower()
        if words_only.match(word):
               words.add(word)
num_unique_words = len(words)

Now words contains only unique words. 现在, words仅包含唯一单词。

I am only posting this because you say that you are new to python, so I want to make sure that you are aware of set s as well. 我之所以仅发布此内容是因为您说您是python的新手,所以我想确保您也了解set Again, for your purposes, a defaultdict works fine and is justified 再次,出于您的目的, defaultdict可以正常工作并且合理

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM