简体   繁体   English

在 Python 中使用 NLTK 的条件频率分布计算语料库中的单词总数(新手)

[英]Count total number of words in a corpus using NLTK's Conditional Frequency Distribution in Python (newbie)

I need to count the number of words (word appearances) in some corpus using NLTK package.我需要使用 NLTK 包计算某些语料库中的单词数(单词出现次数)。

Here is my corpus:这是我的语料库:

corpus = PlaintextCorpusReader('C:\DeCorpus', '.*')

Here is how I try to get the total number of words for each document:这是我尝试获取每个文档的总字数的方法:

cfd_appr = nltk.ConditionalFreqDist(
    (textname, num_appr)
    for textname in corpus.fileids()
    for num_appr in [len(w) for w in corpus.raw(fileids=textname).replace("\r", " ").replace("\n", " ").split()])

(I split strings into words manually, somehow it works better then using corpus.words() , but the problem remains the same, so it's irrelevant). (我手动将字符串拆分为单词,不知何故它比使用corpus.words()效果更好,但问题仍然存在,因此无关紧要)。 Generally, this does the same (wrong) job:一般来说,这做同样的(错误的)工作:

cfd_appr = nltk.ConditionalFreqDist(
    (textname, num_appr)
    for textname in corpus.fileids()
    for num_appr in [len(w) for w in corpus.words(fileids=textname)])

This is what I get by typing cfd.appr.tabulate() :这是我通过输入cfd.appr.tabulate()得到的:

                        1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  
2022.12.06_Bild 2.txt   3  36 109  40  47  43  29  29  33  23  24  12   8   6   4   2   2   0   0   0   0   
2022.12.06_Bild 3.txt   2  42 129  59  57  46  46  35  22  24  17  21  13   5   6   6   2   2   2   0   0   
2022.12.06_Bild 4.txt   3  36 106  48  43  32  38  30  19  39  15  14  16   6   5   8   3   2   3   1   0   
2022.12.06_Bild 5.txt   1  55 162  83  68  72  46  24  34  38  27  16  12   8   8   5   9   3   1   5   1   
2022.12.06_Bild 6.txt   7  69 216  76 113  83  73  52  49  42  37  20  19   9   7   5   3   6   3   0   1   
2022.12.06_Bild 8.txt   0   2   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   

But these are numbers of words of different length.但这些是不同长度的单词数量。 What I need is just this (only one type of item (text) should be counted by number of words):我需要的只是这个(只有一种类型的项目(文本)应该按字数计算):

2022.12.06_Bild 2.txt    451.0
2022.12.06_Bild 3.txt    538.0
2022.12.06_Bild 4.txt    471.0
2022.12.06_Bild 5.txt    679.0
2022.12.06_Bild 6.txt    890.0
2022.12.06_Bild 8.txt      3.0
dtype: float64

Ie the sum of all words of different length (or sum of columns that was composed using DataFrame(cfd_appr).transpose().sum(axis=1) . (By the way, if there is some way to set up a name for this column that would also a solution, but .rename({None: 'W. appear.'}, axis='columns') is not working, and the solution would be generally not clear enough.即所有不同长度的单词的总和(或使用DataFrame(cfd_appr).transpose().sum(axis=1)组成的列的DataFrame(cfd_appr).transpose().sum(axis=1) 。(顺便说一句,如果有办法为此列也将是一个解决方案,但是.rename({None: 'W. appear.'}, axis='columns')不起作用,并且该解决方案通常不够清晰。

So, what I need is:所以,我需要的是:

                             1    
2022.12.06_Bild 2.txt    451.0
2022.12.06_Bild 3.txt    538.0
2022.12.06_Bild 4.txt    471.0
2022.12.06_Bild 5.txt    679.0
2022.12.06_Bild 6.txt    890.0
2022.12.06_Bild 8.txt      3.0

Would be grateful for help!将不胜感激帮助!

Lets first try to replicate your table with the infamous BookCorpus , with directory structure:让我们首先尝试使用臭名昭著的BookCorpus复制您的表,并具有目录结构:

/books_in_sentences
   books_large_p1.txt
   books_large_p2.txt

In Code:在代码中:

from nltk.corpus import PlaintextCorpusReader
from nltk import ConditionalFreqDist
from nltk import word_tokenize

from collections import Counter

import pandas as pd

corpus = PlaintextCorpusReader('books_in_sentences/', '.*')

cfd_appr = ConditionalFreqDist(
    (textname, num_appr)
    for textname in corpus.fileids()
    for num_appr in [len(w) for w in 
                     word_tokenize(corpus.raw(fileids=textname))])

Then the pandas munging part:然后是pandas munging部分:

# Idiom to convert a FreqDist / ConditionalFreqDist into pd.DataFrame.
df = pd.DataFrame([dict(Counter(freqdist)) 
                   for freqdist in cfd_appr.values()], 
                 index=cfd_appr.keys())
# Fill in the not-applicable with zeros.
df = df.fillna(0).astype(int)

# If necessary, sort order of columns and add accordingly.
df = df.sort_values(list(df))

# Sum all columns per row -> pd.Series
counts_per_row = df.sum(axis=1)

Finally, to access the indexed Series, eg :最后,访问索引系列,例如:

print('books_large_p1.txt', counts_per_row['books_large_p1.txt'])

Alternatively或者

I would encourage the above solution so that you can work with the DataFrame to manipulate the numbers further but if all you need is really just the count of the columns per row, then try the following.我会鼓励上述解决方案,以便您可以使用 DataFrame 进一步操作数字,但如果您需要的只是每行的列数,请尝试以下操作。

If there's a need to avoid pandas and use the values in CFD directly, then you would have to make use of the ConditionalFreqDist.values() and iterate through it carefully.如果需要避免熊猫并直接使用 CFD 中的值,那么您必须使用ConditionalFreqDist.values()并仔细遍历它。

If we do:如果我们这样做:

>>> list(cfd_appr.values())
[FreqDist({3: 6, 6: 5, 1: 5, 9: 4, 4: 4, 2: 3, 8: 2, 10: 2, 7: 1, 14: 1}),
 FreqDist({4: 10, 3: 9, 1: 5, 7: 4, 2: 4, 5: 3, 6: 3, 11: 1, 9: 1})]

We'll see a list of FreqDist, each one respective to the keys (in this case the filenames):我们将看到一个 FreqDist 列表,每个列表都对应于键(在本例中为文件名):

>>> list(cfd_appr.keys())
['books_large_p1.txt', 'books_large_p2.txt']

Since we know that FreqDist is a subclass of collections.Counter object , if we sum the values of each Counter object, we will get:由于我们知道FreqDist 是 collections.Counter 对象的子类,如果我们对每个 Counter 对象的值求和,我们将得到:

>>> [sum(fd.values()) for fd in cfd_appr.values()]
[33, 40]

Which outputs the same values as df.sum(axis=1) above.它输出与上面的df.sum(axis=1)相同的值。

So to put it together:所以把它放在一起:

>>> dict(zip(cfd_appr.keys(), [sum(fd.values()) for fd in cfd_appr.values()]))
{'books_large_p1.txt': 33, 'books_large_p2.txt': 40}

Well, here is what was actually needed:好吧,这是实际需要的:

First, get the numbers of words of different length (just as I did before):首先,获取不同长度的单词数(就像我之前所做的那样):

cfd_appr = nltk.ConditionalFreqDist(
    (textname, num_appr)
    for textname in corpus.fileids()
    for num_appr in [len(w) for w in corpus.raw(fileids=textname).replace("\r", " ").replace("\n", " ").split()])

Then add import DataFrame as pd and add to_frame(1) to the dtype: float64 Series that I got by summing the columns:然后将 import DataFrame添加为pd并将to_frame(1)添加到我通过对列求和得到的to_frame(1) dtype: float64系列:

pd.DataFrame(cfd_appr).transpose().sum(axis=1).to_frame(1)

That's it.就是这样。 However, if somebody knows how to sum them uo in the definition of cfd_appr , that would be some more elegant solution.但是,如果有人知道如何在cfd_appr的定义中对cfd_appr ,那将是一些更优雅的解决方案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM