如何按年計算熊貓數據框列中最常出現的單詞？

Question

我有一個包含“評論”列和“年”列的熊貓數據框。 我想查看評論列中出現頻率最高的前 100 個詞，但按年份過濾。 所以，我想知道從 2002 年、2003 年、2004 年等一直到 2017 年的前 100 名。

import pandas as pd
from nltk.corpus import stopwords

df=pd.read_csv('./reviews.csv')

stop = stopwords.words('english')

commonwords = pd.Series(' '.join(df['reviews']).lower().split()).value_counts()[:100]

print(commonwords)

df.to_csv('commonwords.csv', index=False)

上面的代碼有效，但它只給出了所有年份中出現頻率最高的前 100 個單詞。

Answer 1

在創建 commonwords 數據框之前，您可以使用 groupby 操作創建另一個數據框，例如df.groupby(['year', 'reviews']) 。 然后使用 reset_index 操作，以便您可以使用它來過濾前 100 個。

除了重置索引外，您還可以參考此問題中的答案以獲得進一步的想法。

Answer 2

您可以使用：

df = pd.DataFrame({'reviews':['He writer in me great great me',
                        'great ambience the coffee was great',
                        'great coffee'],
                   'year':[2002,2004,2004]})
print (df)

                               reviews  year
0       He writer in me great great me  2002
1  great ambience the coffee was great  2004
2                         great coffee  2004

#change for 100 for top100 in real data
N = 3
df1 =  (df.set_index('year')['reviews']
          .str.lower()
          .str.split(expand=True)
          .stack()
          .groupby(level=0)
          .value_counts()
          .groupby(level=0)
          .head(N)
          .rename_axis(('year','words'))
          .reset_index(name='count'))

print (df1)
   year     words  count
0  2002     great      2
1  2002        me      2
2  2002        he      1
3  2004     great      3
4  2004    coffee      2
5  2004  ambience      1

說明：

通過Series.str.lower和Series.str.split for DataFrame將值轉換為小寫
通過DataFrame.stack為MultiIndex Series重塑
使用SeriesGroupBy.value_counts值進行計數，對值進行排序
通過GroupBy.head獲取前N值
數據清理 - DataFrame.rename_axis和DataFrame.reset_index

如何按年計算熊貓數據框列中最常出現的單詞？

問題描述

2 個解決方案

解決方案1
1 2019-07-19 06:15:52

解決方案2
1 已采納 2019-07-19 11:51:06

如何按年計算熊貓數據框列中最常出現的單詞？

問題描述

2 個解決方案

解決方案1 1 2019-07-19 06:15:52

解決方案2 1 已采納 2019-07-19 11:51:06

解決方案1
1 2019-07-19 06:15:52

解決方案2
1 已采納 2019-07-19 11:51:06