計算txt文件中單詞和唯一單詞的數量-Python

Question

我試圖讀取一個文本文件，刪除標點符號，將所有內容都改成小寫，然后打印單詞總數，唯一單詞總數（例如，“ a”，如果它是文本中的20倍，（只會被計數一次），然后打印最頻繁出現的單詞及其頻率（即a：20）。

我意識到在StackOverflow上也有類似的問題，但是我是一個初學者，正在嘗試使用最少的導入來解決此問題，並且想知道是否有一種方法可以對此進行編碼，而不導入類似Collections的方法。

我在下面有我的代碼，但是我不明白為什么我沒有得到我需要的答案。 這段代碼將打印整個文本文件（每個單詞都換行，並刪除所有標點符號），然后打印：

e 1
n 1
N 1
o 1

我認為，這是“無”以其頻率分為字符的情況。 為什么我的代碼給了我這個答案，我該怎么做才能改變它？

代碼如下：

file=open("C:\\Users\\Documents\\AllSonnets.txt", "r")


def strip_sonnets():
    import string
    new_file=file.read().split()
    for words in new_file:
        data=words.translate(string.punctuation)
        data=data.lower()
        data=data.strip(".")
        data=data.strip(",")
        data=data.strip("?")
        data=data.strip(";")
        data=data.strip("!")
        data=data.replace("'","")
        data=data.replace('"',"")
        data=data.strip(":")
        print(data)

new_file=strip_sonnets()
new_file=str(new_file)

count={}
for w in new_file:
    if w in count:
        count[w] += 1
    else:
        count[w] = 1
for word, times in count.items():
    print (word, times)

Answer 1

如果您只想刪除單詞末尾的標點符號，則不需要翻譯。 一個collections.Counter字典也會為您計算單詞：

from collections import Counter
from string import punctuation


with open("in.txt") as f:       
    c = Counter(word.http://stackoverflow.com/posts/29328942/editrstrip(punctuation) for line in f for  word in line.lower().split())

# print each word and how many times it appears
for k, freq in c.items():
   print(k,freq)

要查看出現頻率最高到最低的單詞，可以使用.most_common() ：

for k,v in c.most_common():
    print(k,v)

如果沒有導入，請使用dict.get ：

c = {}
with open("in.txt") as f:
    for line in f:
        for word in line.lower().split():
            key = word.rstrip(punctuation)
            c[key] = c.get(key, 0) + 1

然后按頻率排序：

from operator import itemgetter

for k,v in sorted(c.items(),key=itemgetter(1),reverse=True):
    print(k,v)

之所以看到“無”，是因為您設置了new_file=strip_sonnets()並且您的函數未返回任何內容，因此對於所有未指定返回值的函數，默認情況下它均返回“ None 。

然后，您設置new_file=str(new_file)以便for w in new_file中遍歷for w in new_file遍歷None每個字符。

您需要返回數據：

def strip_sonnets():
    new_file=file.read().split()
    for words in new_file:
        data= words.translate(string.punctuation)
        data=data.lower()
        data=data.strip(".")
        data=data.strip(",")
        data=data.strip("?")
        data=data.strip(";")
        data=data.strip("!")
        data=data.replace("'","")
        data=data.replace('"',"")
        data=data.strip(":")
    return data # return

我將簡化您的函數，使其返回生成器表達式，該表達式將返回去除標點符號和降低的所有單詞：

 path = "C:\\Users\\Documents\\AllSonnets.txt"

def strip_sonnets():
    with open(path, "r") as f:     
        return (word.lower().rstrip(punctuation) for line in f for word in line.split())

.rstrip(punctuation)基本上是在嘗試使用strip和replace替換代碼。

計算txt文件中單詞和唯一單詞的數量-Python

問題描述

1 個解決方案

解決方案1
0 2015-03-29 12:12:16

計算txt文件中單詞和唯一單詞的數量-Python

問題描述

1 個解決方案

解決方案1 0 2015-03-29 12:12:16

解決方案1
0 2015-03-29 12:12:16