如何计算pandas数据帧中500个最常用的单词

Question

I have a dataframe with 500 texts in a column called Text (1 text per row) and i want to count the most common words of all texts. 我在一个名为Text（每行1个文本）的列中有一个包含500个文本的数据框，我想计算所有文本中最常用的单词。

I tried so far (both methods from stackoverflow): 我到目前为止尝试过（来自stackoverflow的两种方法）：

pd.Series(' '.join(df['Text']).lower().split()).value_counts()[:100]

and 和

Counter(" ".join(df["Text"]).split()).most_common(100)

both gave me the following error: 两个都给了我以下错误：

TypeError: sequence item 0: expected str instance, list found TypeError：序列项0：预期的str实例，找到列表

And i have tried the counter method simply with 而我已经尝试过简单的计数器方法

df.Text.apply(Counter())

which gave me the word count in each text and i also altered the counter method so it returned the most common words in each text 它给了我每个文本中的单词计数，我也改变了计数器方法，因此它返回了每个文本中最常用的单词

But i want the overall most common words 但我想要总体上最常见的词汇

Here is a sample of the dataframe (the text is already lowercased, cleaned from punctuation, tokenized, and stop words are removed) 这是一个数据帧的示例（文本已经小写，从标点符号清除，标记化，并删除停用词）

    Datum   File    File_type                                         Text                         length    len_cleaned_text
Datum                                                   
2000-01-27  2000-01-27  _04.txt     _04     [business, date, jan, heineken, starts, integr...       396         220

Edit: Code to 'reporduce' it 编辑：代码'重新定义'它

  for file in file_list:
    name = file[len(input_path):]
        date = name[11:17]
        type_1 = name[17:20] 


with open(file, "r", encoding="utf-8", errors="surrogateescape") as rfile:
                format
                text = rfile.read()
                text = text.encode('utf-8', 'ignore')
                text = text.decode('utf-8', 'ignore')
     a={"File": name, "Text": text,'the':count_the, 'Datum': date, 'File_type': type_1, 'length':length,}
        result_list.append(a)

new cell 新细胞

  df['Text']= df['Text'].str.lower()
    p = re.compile(r'[^\w\s]+')
    d = re.compile(r'\d+')
    for index, row in df.iterrows():
        df['Text']=df['Text'].str.replace('\n',' ')
        df['Text']=df['Text'].str.replace('################################ end of story 1 ##############################','')
        df['Text'] = [p.sub('', x) for x in df['Text'].tolist()]
        df['Text'] = [d.sub('', x) for x in df['Text'].tolist()]
    df['Text']=df['Text'].apply(word_tokenize)


    Datum   File    File_type   Text    length  the
Datum                       
2000-01-27  2000-01-27  0864820040_000127_04.txt    _04     [business, date, jan, heineken, starts, integr...   396     0
2000-02-01  2000-02-01  0910068040_000201_04.txt    _04     [group, english, cns, date, feb, bat, acquisit...   305     0
2000-05-03  2000-05-03  1070448040_000503_04.txt    _04     [date, may, cobham, plc, cob, acquisitionsdisp...   701     0
2000-05-11  2000-05-11  0865985020_000511_04.txt    _04     [business, date, may, swedish, match, complete...   439     0
2000-11-28  2000-11-28  1067252020_001128_04.txt    _04     [date, nov, intec, telecom, sys, itl, doc, pla...   158     0
2000-12-18  2000-12-18  1963867040_001218_04.txt    _04     [associated, press, apw, date, dec, volvo, div...   367     0
2000-12-19  2000-12-19  1065767020_001219_04.txt    _04     [date, dec, spirent, plc, spt, acquisition, co...   414     0
2000-12-21  2000-12-21  1076829040_001221_04.txt    _04     [bloomberg, news, bn, date, dec, eni, ceo, cfo...   271     0
2001-02-06  2001-02-06  1084749020_010206_04.txt    _04     [date, feb, chemring, group, plc, chg, acquisi...   130     0
2001-02-15  2001-02-15  1063497040_010215_04.txt    _04     [date, feb, electrolux, ab, elxb, acquisition,...   420     0

And a description of the dataframe: 以及数据帧的描述：

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 557 entries, 2000-01-27 to 2017-10-06
Data columns (total 13 columns):
Datum               557 non-null datetime64[ns]
File                557 non-null object
File_type           557 non-null object
Text                557 non-null object
customers           557 non-null int64
grwoth              557 non-null int64
human               557 non-null int64
intagibles          557 non-null int64
length              557 non-null int64
synergies           557 non-null int64
technology          557 non-null int64
the                 557 non-null int64
len_cleaned_text    557 non-null int64
dtypes: datetime64[ns](1), int64(9), object(3)
memory usage: 60.9+ KB

Thanks in advance 提前致谢

Answer 1

Ok, I got it. 好，我知道了。 Your df['Text'] consists of lists of texts. 你的df['Text']包含文本列表。 So you can do this: 所以你可以这样做：

full_list = []  # list containing all words of all texts
for elmnt in df['Text']:  # loop over lists in df
    full_list += elmnt  # append elements of lists to full list

val_counts = pd.Series(full_list).value_counts()  # make temporary Series to count

This solution avoids using too many list comprehensions and thus keeps the code easy to read and understand. 此解决方案避免使用过多的列表推导，从而使代码易于阅读和理解。 Furthermore no additional modules like re or collections are needed. 此外，不需要额外的模块，如re或collections 。

Answer 2

Here is my version where I convert the column values into a list, then I make a list of words, clean it, and you have your counter: 这是我的版本，我将列值转换为列表，然后我创建一个单词列表，清理它，你有你的计数器：

your_text_list = df['Text'].tolist()
your_text_list_nan_rm = [x for x in your_text_list if str(x) != 'nan']
flat_list = [inner for item in your_text_list_nan_rm for inner in ast.literal_eval(item)] 

counter = collections.Counter(flat_list)
top_words = counter.most_common(100)

Answer 3

You can do it via apply and Counter.update methods: 您可以通过apply和Counter.update方法来完成：

from collections import Counter

counter = Counter()
df = pd.DataFrame({'Text': values})
_ = df['Text'].apply(lambda x: counter.update(x))

counter.most_common(10) 
Out:

[('Amy', 3), ('was', 3), ('hated', 2),
 ('Kamal', 2), ('her', 2), ('and', 2), 
 ('she', 2), ('She', 2), ('sent', 2), ('text', 2)]

Where df['Text'] is: 其中df['Text']是：

0    [Amy, normally, hated, Monday, mornings, but, ...
1    [Kamal, was, in, her, art, class, and, she, li...
2    [She, was, waiting, outside, the, classroom, w...
3              [Hi, Amy, Your, mum, sent, me, a, text]
4                         [You, forgot, your, inhaler]
5    [Why, don’t, you, turn, your, phone, on, Amy, ...
6    [She, never, sent, text, messages, and, she, h...
Name: Text, dtype: object

如何计算pandas数据帧中500个最常用的单词

问题描述

3 个解决方案

解决方案1
1 已采纳 2018-11-21 09:58:36

解决方案2
1 2018-11-21 10:05:30

解决方案3
0 2018-11-21 11:51:07

如何计算pandas数据帧中500个最常用的单词

问题描述

3 个解决方案

解决方案1 1 已采纳 2018-11-21 09:58:36

解决方案2 1 2018-11-21 10:05:30

解决方案3 0 2018-11-21 11:51:07

解决方案1
1 已采纳 2018-11-21 09:58:36

解决方案2
1 2018-11-21 10:05:30

解决方案3
0 2018-11-21 11:51:07