如何查找文件中十個最常用單詞的頻率？

Question

我正在 Python 上寫一個 function ，它將文本文件的名稱（作為字符串）作為輸入。 function 應該首先確定每個單詞在文件中出現的次數。 稍后，我將制作一個條形圖，顯示文件中最常見的十個單詞的頻率，每個條形旁邊是第二個條形，其高度是 Zipf 定律預測的頻率。 我已經有一些圖表代碼，但我需要幫助來查找文本文件中最常見的單詞。

def zipf_graph(text_file):
    import string
    file = open(text_file, encoding = 'utf8')
    text = file.read()
    file.close()

    #the following strips and removes punctuation and makes the words lowercase
    punc = string.punctuation + '’”—⎬⎪“⎫'
    new_text = text
    for char in punc:
        new_text = new_text.replace(char,'')
        new_text = new_text.lower()
    text_split = new_text.split()

我被困在這里，我試圖在列表中找到最常見的字符串，但我不確定從這里到 go 的位置，以下是我嘗試過的：

    words = text_split
    most_common = max(words, key = words.count)
    # print(most_common)

我還想添加以下代碼，因為它被建議幫助

    # Sorting a list by frequency
    # Assumes you have your elements as (word, frequency) tuples
    # (Useful for the zipf function)
    words = [('the', 1), ('and', 1), ('test',2)]
    sorted(words, key = lambda x: x[1], reverse = True)

    # "Sorting" a dictionary by frequency
    # Assumes you have your elements as word:frequency
    # (Useful for the zipf function)
    words = dict()
    words['the'] = 1
    words['and'] = 1
    words['test'] = 2

    # This returns a list of just the most common words without their frequencies
    most_common_words = sorted(words, key = words.get, reverse = True)
    # print(most_common_words)

    # We can go back to the dictionary to get the frequencies
    for word in most_common_words:
        print(word, words[word])

zipf_graph('fortune.txt') #name of the file I chose to use

Answer 1

我建議您使用collections中的Counter 。

from collections import Counter

text_split = ["a", "b", "c", "a", "c", "d", "a", "d", "b"]
word_and_freq = Counter(text_split)
top = word_and_freq.most_common(2)

print(top)

有趣的是，這會返回您想要的格式。

[("a", 3), ("b", 2)]

Answer 2

您可以使用 nltk 庫：

import nltk
words = ['words', 'in', 'the', 'file']
fd = nltk.FreqDist(words)
fd.most_common(10)

將以以下格式給出值：

[('file', 1), ('words', 1), ('in', 1), ('the', 1)]

如何查找文件中十個最常用單詞的頻率？

問題描述

2 個解決方案

解決方案1
5 2021-03-02 17:14:25

解決方案2
3 2021-03-02 17:11:27

如何查找文件中十個最常用單詞的頻率？

問題描述

2 個解決方案

解決方案1 5 2021-03-02 17:14:25

解決方案2 3 2021-03-02 17:11:27

解決方案1
5 2021-03-02 17:14:25

解決方案2
3 2021-03-02 17:11:27