在 Python 中使用 NLTK 的條件頻率分布計算語料庫中的單詞總數（新手）

Question

我需要使用 NLTK 包計算某些語料庫中的單詞數（單詞出現次數）。

這是我的語料庫：

corpus = PlaintextCorpusReader('C:\DeCorpus', '.*')

這是我嘗試獲取每個文檔的總字數的方法：

cfd_appr = nltk.ConditionalFreqDist(
    (textname, num_appr)
    for textname in corpus.fileids()
    for num_appr in [len(w) for w in corpus.raw(fileids=textname).replace("\r", " ").replace("\n", " ").split()])

（我手動將字符串拆分為單詞，不知何故它比使用corpus.words()效果更好，但問題仍然存在，因此無關緊要）。 一般來說，這做同樣的（錯誤的）工作：

cfd_appr = nltk.ConditionalFreqDist(
    (textname, num_appr)
    for textname in corpus.fileids()
    for num_appr in [len(w) for w in corpus.words(fileids=textname)])

這是我通過輸入cfd.appr.tabulate()得到的：

                        1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  
2022.12.06_Bild 2.txt   3  36 109  40  47  43  29  29  33  23  24  12   8   6   4   2   2   0   0   0   0   
2022.12.06_Bild 3.txt   2  42 129  59  57  46  46  35  22  24  17  21  13   5   6   6   2   2   2   0   0   
2022.12.06_Bild 4.txt   3  36 106  48  43  32  38  30  19  39  15  14  16   6   5   8   3   2   3   1   0   
2022.12.06_Bild 5.txt   1  55 162  83  68  72  46  24  34  38  27  16  12   8   8   5   9   3   1   5   1   
2022.12.06_Bild 6.txt   7  69 216  76 113  83  73  52  49  42  37  20  19   9   7   5   3   6   3   0   1   
2022.12.06_Bild 8.txt   0   2   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0

但這些是不同長度的單詞數量。 我需要的只是這個（只有一種類型的項目（文本）應該按字數計算）：

2022.12.06_Bild 2.txt    451.0
2022.12.06_Bild 3.txt    538.0
2022.12.06_Bild 4.txt    471.0
2022.12.06_Bild 5.txt    679.0
2022.12.06_Bild 6.txt    890.0
2022.12.06_Bild 8.txt      3.0
dtype: float64

即所有不同長度的單詞的總和（或使用DataFrame(cfd_appr).transpose().sum(axis=1)組成的列的DataFrame(cfd_appr).transpose().sum(axis=1) 。（順便說一句，如果有辦法為此列也將是一個解決方案，但是.rename({None: 'W. appear.'}, axis='columns')不起作用，並且該解決方案通常不夠清晰。

所以，我需要的是：

                             1    
2022.12.06_Bild 2.txt    451.0
2022.12.06_Bild 3.txt    538.0
2022.12.06_Bild 4.txt    471.0
2022.12.06_Bild 5.txt    679.0
2022.12.06_Bild 6.txt    890.0
2022.12.06_Bild 8.txt      3.0

將不勝感激幫助！

Answer 1

讓我們首先嘗試使用臭名昭著的BookCorpus復制您的表，並具有目錄結構：

/books_in_sentences
   books_large_p1.txt
   books_large_p2.txt

在代碼中：

from nltk.corpus import PlaintextCorpusReader
from nltk import ConditionalFreqDist
from nltk import word_tokenize

from collections import Counter

import pandas as pd

corpus = PlaintextCorpusReader('books_in_sentences/', '.*')

cfd_appr = ConditionalFreqDist(
    (textname, num_appr)
    for textname in corpus.fileids()
    for num_appr in [len(w) for w in 
                     word_tokenize(corpus.raw(fileids=textname))])

然后是pandas munging部分：

# Idiom to convert a FreqDist / ConditionalFreqDist into pd.DataFrame.
df = pd.DataFrame([dict(Counter(freqdist)) 
                   for freqdist in cfd_appr.values()], 
                 index=cfd_appr.keys())
# Fill in the not-applicable with zeros.
df = df.fillna(0).astype(int)

# If necessary, sort order of columns and add accordingly.
df = df.sort_values(list(df))

# Sum all columns per row -> pd.Series
counts_per_row = df.sum(axis=1)

最后，訪問索引系列，例如：

print('books_large_p1.txt', counts_per_row['books_large_p1.txt'])

或者

我會鼓勵上述解決方案，以便您可以使用 DataFrame 進一步操作數字，但如果您需要的只是每行的列數，請嘗試以下操作。

如果需要避免熊貓並直接使用 CFD 中的值，那么您必須使用ConditionalFreqDist.values()並仔細遍歷它。

如果我們這樣做：

>>> list(cfd_appr.values())
[FreqDist({3: 6, 6: 5, 1: 5, 9: 4, 4: 4, 2: 3, 8: 2, 10: 2, 7: 1, 14: 1}),
 FreqDist({4: 10, 3: 9, 1: 5, 7: 4, 2: 4, 5: 3, 6: 3, 11: 1, 9: 1})]

我們將看到一個 FreqDist 列表，每個列表都對應於鍵（在本例中為文件名）：

>>> list(cfd_appr.keys())
['books_large_p1.txt', 'books_large_p2.txt']

由於我們知道FreqDist 是 collections.Counter 對象的子類，如果我們對每個 Counter 對象的值求和，我們將得到：

>>> [sum(fd.values()) for fd in cfd_appr.values()]
[33, 40]

它輸出與上面的df.sum(axis=1)相同的值。

所以把它放在一起：

>>> dict(zip(cfd_appr.keys(), [sum(fd.values()) for fd in cfd_appr.values()]))
{'books_large_p1.txt': 33, 'books_large_p2.txt': 40}

Answer 2

好吧，這是實際需要的：

首先，獲取不同長度的單詞數（就像我之前所做的那樣）：

cfd_appr = nltk.ConditionalFreqDist(
    (textname, num_appr)
    for textname in corpus.fileids()
    for num_appr in [len(w) for w in corpus.raw(fileids=textname).replace("\r", " ").replace("\n", " ").split()])

然后將 import DataFrame添加為pd並將to_frame(1)添加到我通過對列求和得到的to_frame(1) dtype: float64系列：

pd.DataFrame(cfd_appr).transpose().sum(axis=1).to_frame(1)

就是這樣。 但是，如果有人知道如何在cfd_appr的定義中對cfd_appr ，那將是一些更優雅的解決方案。

在 Python 中使用 NLTK 的條件頻率分布計算語料庫中的單詞總數（新手）

問題描述

2 個解決方案

解決方案1
1 2020-02-19 08:09:33

或者

解決方案2
0 已采納 2020-02-20 01:06:20

在 Python 中使用 NLTK 的條件頻率分布計算語料庫中的單詞總數（新手）

問題描述

2 個解決方案

解決方案1 1 2020-02-19 08:09:33

或者

解決方案2 0 已采納 2020-02-20 01:06:20

解決方案1
1 2020-02-19 08:09:33

解決方案2
0 已采納 2020-02-20 01:06:20