简体   繁体   English

Pandas - groupby,其中每行都有多个存储在列表中的值

[英]Pandas - groupby where each row has multiple values stored in list

I'm working with last.fm listening data and have a DataFrame that looks like this:我正在处理 last.fm 监听数据,并且有一个如下所示的 DataFrame:

           Artist Plays                                   Genres
0   John Coltrane    10             [jazz, modal jazz, hard bop]
1     Miles Davis    15  [jazz, cool jazz, modal jazz, hard bop]
2  Charlie Parker    20                            [jazz, bebop]

I want to group the data by the genres and then aggregate by the sum of plays for each genre, to get something like this:我想按流派对数据进行分组,然后按每个流派的播放总和进行聚合,以获得如下内容:

        Genre Plays
0        jazz    45
1  modal jazz    25
2    hard bop    25
3       bebop    20
4   cool jazz    15

Been trying to figure this out for a while now but can't seem to find the solution.一直试图解决这个问题一段时间,但似乎无法找到解决方案。 Do I need to change the way that the genre data is stored?我是否需要更改类型数据的存储方式?

I was able to find this post which addresses a similar question, but that user was only looking to get the count of each list value.我能够找到解决类似问题的这篇文章,但该用户只想获取每个列表值的计数。 This gets me about halfway there, but I couldn't figure out how to use that to aggregate another column in the dataframe.这让我走到了一半,但我不知道如何使用它来聚合数据帧中的另一列。

In general, you should not store lists in a DataFrame , so yes, probably best to change how they are stored.一般来说,你不应该在DataFrame存储列表,所以是的,最好改变它们的存储方式。 With this you can use some join + str.get_dummies + .multiply .有了这个,你可以使用一些join + str.get_dummies + .multiply Choose a sep that doesn't appear in any of your strings.选择一个不会出现在任何字符串中的sep

sep = '*'
df.Genres.apply(sep.join).str.get_dummies(sep=sep).multiply(df.Plays, axis=0).sum()

Output输出

bebop         20
cool jazz     15
hard bop      25
jazz          45
modal jazz    25
dtype: int64

An easier form to work with would be if your lists were split across lines as in:一种更容易使用的形式是,如果您的列表跨行拆分,如下所示:

import pandas as pd
df1 = pd.concat([pd.DataFrame(df.Genres.values.tolist()).stack().reset_index(1, drop=True).to_frame('Genres'),
                 df[['Plays', 'Artist']]], axis=1)

       Genres  Plays          Artist
0        jazz     10   John Coltrane
0  modal jazz     10   John Coltrane
0    hard bop     10   John Coltrane
1        jazz     15     Miles Davis
1   cool jazz     15     Miles Davis
1  modal jazz     15     Miles Davis
1    hard bop     15     Miles Davis
2        jazz     20  Charlie Parker
2       bebop     20  Charlie Parker

Making it a simple sum within genres:使其成为流派中的简单总和:

df1.groupby('Genres').Plays.sum()

Genres
bebop         20
cool jazz     15
hard bop      25
jazz          45
modal jazz    25
Name: Plays, dtype: int64

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM