[英]How to calculate most frequently occurring words in pandas dataframe column by year?
我有一個包含“評論”列和“年”列的熊貓數據框。 我想查看評論列中出現頻率最高的前 100 個詞,但按年份過濾。 所以,我想知道從 2002 年、2003 年、2004 年等一直到 2017 年的前 100 名。
import pandas as pd
from nltk.corpus import stopwords
df=pd.read_csv('./reviews.csv')
stop = stopwords.words('english')
commonwords = pd.Series(' '.join(df['reviews']).lower().split()).value_counts()[:100]
print(commonwords)
df.to_csv('commonwords.csv', index=False)
上面的代碼有效,但它只給出了所有年份中出現頻率最高的前 100 個單詞。
您可以使用:
df = pd.DataFrame({'reviews':['He writer in me great great me',
'great ambience the coffee was great',
'great coffee'],
'year':[2002,2004,2004]})
print (df)
reviews year
0 He writer in me great great me 2002
1 great ambience the coffee was great 2004
2 great coffee 2004
#change for 100 for top100 in real data
N = 3
df1 = (df.set_index('year')['reviews']
.str.lower()
.str.split(expand=True)
.stack()
.groupby(level=0)
.value_counts()
.groupby(level=0)
.head(N)
.rename_axis(('year','words'))
.reset_index(name='count'))
print (df1)
year words count
0 2002 great 2
1 2002 me 2
2 2002 he 1
3 2004 great 3
4 2004 coffee 2
5 2004 ambience 1
說明:
Series.str.lower
和Series.str.split
for DataFrame
將值轉換為小寫DataFrame.stack
為MultiIndex Series
重塑SeriesGroupBy.value_counts
值進行計數,對值進行排序GroupBy.head
獲取前N
值DataFrame.rename_axis
和DataFrame.reset_index
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.