Pandas - groupby，其中每行都有多个存储在列表中的值

Question

I'm working with last.fm listening data and have a DataFrame that looks like this:我正在处理 last.fm 监听数据，并且有一个如下所示的 DataFrame：

           Artist Plays                                   Genres
0   John Coltrane    10             [jazz, modal jazz, hard bop]
1     Miles Davis    15  [jazz, cool jazz, modal jazz, hard bop]
2  Charlie Parker    20                            [jazz, bebop]

I want to group the data by the genres and then aggregate by the sum of plays for each genre, to get something like this:我想按流派对数据进行分组，然后按每个流派的播放总和进行聚合，以获得如下内容：

        Genre Plays
0        jazz    45
1  modal jazz    25
2    hard bop    25
3       bebop    20
4   cool jazz    15

Been trying to figure this out for a while now but can't seem to find the solution.一直试图解决这个问题一段时间，但似乎无法找到解决方案。 Do I need to change the way that the genre data is stored?我是否需要更改类型数据的存储方式？

I was able to find this post which addresses a similar question, but that user was only looking to get the count of each list value.我能够找到解决类似问题的这篇文章，但该用户只想获取每个列表值的计数。 This gets me about halfway there, but I couldn't figure out how to use that to aggregate another column in the dataframe.这让我走到了一半，但我不知道如何使用它来聚合数据帧中的另一列。

Answer 1

In general, you should not store lists in a DataFrame , so yes, probably best to change how they are stored.一般来说，你不应该在DataFrame存储列表，所以是的，最好改变它们的存储方式。 With this you can use some join + str.get_dummies + .multiply .有了这个，你可以使用一些join + str.get_dummies + .multiply 。 Choose a sep that doesn't appear in any of your strings.选择一个不会出现在任何字符串中的sep 。

sep = '*'
df.Genres.apply(sep.join).str.get_dummies(sep=sep).multiply(df.Plays, axis=0).sum()

Output输出

bebop         20
cool jazz     15
hard bop      25
jazz          45
modal jazz    25
dtype: int64

An easier form to work with would be if your lists were split across lines as in:一种更容易使用的形式是，如果您的列表跨行拆分，如下所示：

import pandas as pd
df1 = pd.concat([pd.DataFrame(df.Genres.values.tolist()).stack().reset_index(1, drop=True).to_frame('Genres'),
                 df[['Plays', 'Artist']]], axis=1)

       Genres  Plays          Artist
0        jazz     10   John Coltrane
0  modal jazz     10   John Coltrane
0    hard bop     10   John Coltrane
1        jazz     15     Miles Davis
1   cool jazz     15     Miles Davis
1  modal jazz     15     Miles Davis
1    hard bop     15     Miles Davis
2        jazz     20  Charlie Parker
2       bebop     20  Charlie Parker

Making it a simple sum within genres:使其成为流派中的简单总和：

df1.groupby('Genres').Plays.sum()

Genres
bebop         20
cool jazz     15
hard bop      25
jazz          45
modal jazz    25
Name: Plays, dtype: int64

Pandas - groupby，其中每行都有多个存储在列表中的值

问题描述

1 个解决方案

解决方案1
1 2019-02-13 19:01:20

Output输出

Pandas - groupby，其中每行都有多个存储在列表中的值

问题描述

1 个解决方案

解决方案1 1 2019-02-13 19:01:20

Output输出

解决方案1
1 2019-02-13 19:01:20