简体   繁体   中英

How do I get the most frequent words in a column of text based on the value of another column?

I have a dataset of tweets and the year they were posted. I want to get a count of the most frequently occurring words each year. My dataset looks like this:

year     tweet
2015     my car is blue
2015     mom is making dinner
2016     my hair is red
2016     i love my mom

I only know how to get the most frequently occurring words for the entire dataset:

pd.Series(' '.join(df['tweets']).split()).value_counts()

Which would give me this:

my      3
is      3
mom     2
car     1
blue    1
making  1
dinner  1
hair    1
red     1
i       1
love    1

So how would I get something like this?

2015

is      2
my      1
car     1
blue    1
mom     1
making  1
dinner  1

2016

my      2
hair    1
is      1
red     1
i       1
love    1
mom     1

I'd do something like this:

counts = df.set_index('year')['tweet'].str.split().explode().groupby(level=0).apply(pd.value_counts)

Output:

>>> counts
year        
2015  is        2
      my        1
      car       1
      blue      1
      mom       1
      making    1
      dinner    1
2016  my        2
      hair      1
      is        1
      red       1
      i         1
      love      1
      mom       1
Name: tweet, dtype: int6

To get the top, say, 5 items per year:

df.set_index('year')['tweet'].str.split().explode().groupby(level=0).apply(lambda x: x.value_counts().head(5))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM