[英]Pandas - groupby where each row has multiple values stored in list
I'm working with last.fm listening data and have a DataFrame that looks like this:我正在处理 last.fm 监听数据,并且有一个如下所示的 DataFrame:
Artist Plays Genres
0 John Coltrane 10 [jazz, modal jazz, hard bop]
1 Miles Davis 15 [jazz, cool jazz, modal jazz, hard bop]
2 Charlie Parker 20 [jazz, bebop]
I want to group the data by the genres and then aggregate by the sum of plays for each genre, to get something like this:我想按流派对数据进行分组,然后按每个流派的播放总和进行聚合,以获得如下内容:
Genre Plays
0 jazz 45
1 modal jazz 25
2 hard bop 25
3 bebop 20
4 cool jazz 15
Been trying to figure this out for a while now but can't seem to find the solution.一直试图解决这个问题一段时间,但似乎无法找到解决方案。 Do I need to change the way that the genre data is stored?
我是否需要更改类型数据的存储方式?
I was able to find this post which addresses a similar question, but that user was only looking to get the count of each list value.我能够找到解决类似问题的这篇文章,但该用户只想获取每个列表值的计数。 This gets me about halfway there, but I couldn't figure out how to use that to aggregate another column in the dataframe.
这让我走到了一半,但我不知道如何使用它来聚合数据帧中的另一列。
In general, you should not store lists in a DataFrame
, so yes, probably best to change how they are stored.一般来说,你不应该在
DataFrame
存储列表,所以是的,最好改变它们的存储方式。 With this you can use some join
+ str.get_dummies
+ .multiply
.有了这个,你可以使用一些
join
+ str.get_dummies
+ .multiply
。 Choose a sep
that doesn't appear in any of your strings.选择一个不会出现在任何字符串中的
sep
。
sep = '*'
df.Genres.apply(sep.join).str.get_dummies(sep=sep).multiply(df.Plays, axis=0).sum()
bebop 20
cool jazz 15
hard bop 25
jazz 45
modal jazz 25
dtype: int64
An easier form to work with would be if your lists were split across lines as in:一种更容易使用的形式是,如果您的列表跨行拆分,如下所示:
import pandas as pd
df1 = pd.concat([pd.DataFrame(df.Genres.values.tolist()).stack().reset_index(1, drop=True).to_frame('Genres'),
df[['Plays', 'Artist']]], axis=1)
Genres Plays Artist
0 jazz 10 John Coltrane
0 modal jazz 10 John Coltrane
0 hard bop 10 John Coltrane
1 jazz 15 Miles Davis
1 cool jazz 15 Miles Davis
1 modal jazz 15 Miles Davis
1 hard bop 15 Miles Davis
2 jazz 20 Charlie Parker
2 bebop 20 Charlie Parker
Making it a simple sum within genres:使其成为流派中的简单总和:
df1.groupby('Genres').Plays.sum()
Genres
bebop 20
cool jazz 15
hard bop 25
jazz 45
modal jazz 25
Name: Plays, dtype: int64
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.