使用 NLTK 计算语料库中单词列表的频率

Question

I have downloaded a corpus and tokenised the words.我已经下载了一个语料库并标记了这些词。 I have a list of the main characters and I want to find out how many times each name appears in the corpus.我有一个主要角色的列表，我想知道每个名字在语料库中出现了多少次。 I have tried using a frequency function with a dictionary but I don't know how to get the name count..我曾尝试使用频率 function 和字典，但我不知道如何获取名称计数..

character_list = ['Myriel','Bishop','Baptistine','Magloire','Cravatte','Valjean','Gervais','Fantine','Tholomyès'
                  ,'Blachevelle','Dahlia','Fameuil','Favourite','Listolier','Zéphine','Cosette','Thénardier',
                  'Éponine','Azelma','Javert','Fauchelevent','Bamatabois','Champmathieu',
                  'Brevet','Simplice','Chenildieu','Cochepaille','Innocente','Reverend','Ascension','Crucifixion',
                  'Gavroche','Magnon',
                  'Gillenormand','Marius','Colonel','Mabeuf','Enjolras','Combeferre','Prouvaire',
                 'Feuilly','Courfeyrac','Bahorel','Lesgle','Joly','Grantaire','Patron-Minette','Brujon',
                 'Toussaint'] 


fdist_mis = FreqDist(word_tokens)

filtered_word_freqt = dict((character_list, freq) for character_list, freq in fdist_mis.items())

When I explore filtered_word_freqt, it just returns all of the word tokens instead of a dictionary of the unique characters and their occurrences.当我探索filtered_word_freqt 时，它只返回所有单词标记，而不是唯一字符及其出现的字典。 Any help?有什么帮助吗？ Thanks a lot.非常感谢。

Answer 1

How would you like to see the frequency?您希望如何查看频率？ You can get a count of # times each word was seen or a ratio of how often within the total text or even a fancy formatted table.您可以获得每个单词被看到的 # 次计数，或者在总文本甚至是精美格式的表格中的频率比率。 Relevant functions copied from here :从这里复制的相关功能：

N()[source]
Return the total number of sample outcomes that have been recorded by this FreqDist. For the number of unique sample values (or bins) with counts greater than zero, use FreqDist.B().
Return type:    int

freq(sample)[source]
Return the frequency of a given sample. The frequency of a sample is defined as the count of that sample divided by the total number of sample outcomes that have been recorded by this FreqDist. The count of a sample is defined as the number of times that sample outcome was recorded by this FreqDist. Frequencies are always real numbers in the range [0, 1].

tabulate(*args, **kwargs)[source]
Tabulate the given samples from the frequency distribution (cumulative), displaying the most frequent sample first. If an integer parameter is supplied, stop after this many samples have been plotted.
Parameters: samples (list) – The samples to plot (default is all samples)

使用 NLTK 计算语料库中单词列表的频率

问题描述

1 个解决方案

解决方案1
0 2021-11-29 22:03:04

使用 NLTK 计算语料库中单词列表的频率

问题描述

1 个解决方案

解决方案1 0 2021-11-29 22:03:04

解决方案1
0 2021-11-29 22:03:04