拆分管道分隔的系列，按单独的系列分组，并在新列中返回每个拆分值的计数

Question

给定带有管道分隔系列的 dataframe：

import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame({'year': [1960, 1960, 1961, 1961, 1961],
                   'genre': ['Drama|Romance|Thriller',
                             'Spy|Mystery|Bio',
                             'Drama|Romance',
                             'Drama|Romance',
                             'Drama|Spy']})

或数据格式：

   year                   genre
0  1960  Drama|Romance|Thriller
1  1960         Spy|Mystery|Bio
2  1961           Drama|Romance
3  1961           Drama|Romance
4  1961               Drama|Spy

我可以用str.split拆分genre系列（如许多类似的 SO 问题所示）。

但我也想按年份分组，并在新列中返回每个独特年份的Drama 、 Romance 、 Thriller等的计数。

我最初的尝试：

df_split = df.groupby('year')['genre'].apply(lambda x: x.str.split('|', expand=True).reset_index(drop=True))

返回

            0        1         2
year                            
1960 0  Drama  Romance  Thriller
     1    Spy  Mystery       Bio
1961 0  Drama  Romance       NaN
     1  Drama  Romance       NaN
     2  Drama      Spy       NaN

但是如何按年份在自己的列中获取每种独特类型的计数？

我可以使用

genres = pd.unique(df['genre'].str.split('|', expand=True).stack())

但我仍然不确定如何将流派作为单独的系列，按年份计算。

我想要的最终 output 是：

      Drama  Romance  Thriller  Spy  Mystery  Bio
1960      1        1         1    1        1    1
1961      3        2         0    1        0    0

其中每个独特的流派都有自己的系列，并按年份进行相应的计数。

这也很可能是一个 XY 问题。 我的最终目标是制作一个百分比堆积面积图。 假设df_split具有所需的转换，我想做：

df_perc = df_split.divide(df_split.sum(axis=1), axis=0)

返回

         Drama   Romance  Thriller       Spy   Mystery       Bio
1960  0.166667  0.166667  0.166667  0.166667  0.166667  0.166667
1961  0.500000  0.333333  0.000000  0.166667  0.000000  0.000000

接着

plt.stackplot(df_perc.index, *[ts for col, ts in df_perc.iteritems()],
                               labels=df_perc.columns)
plt.gca().set_xticks(df_perc.index)
plt.margins(0)
plt.legend()

给出 output：

Answer 1

我们可以使用一些简单的整形和聚合来获得您想要的结果：

(df.assign(genre=df['genre'].str.split('|'))
   .explode('genre')
   .groupby('year')['genre']
   .value_counts(normalize=True)
   .unstack(fill_value=0))     
 
genre       Bio     Drama   Mystery   Romance       Spy  Thriller
year                                                             
1960   0.166667  0.166667  0.166667  0.166667  0.166667  0.166667
1961   0.000000  0.500000  0.000000  0.333333  0.166667  0.000000

从这里你可以通过绘制一个区域 plot 来完成：

(df.assign(genre=df['genre'].str.split('|'))
   .explode('genre')
   .groupby('year')['genre']
   .value_counts(normalize=True)
   .unstack(fill_value=0)
   .plot
   .area())

这个怎么运作

从跨行分解数据开始：

df.assign(genre=df['genre'].str.split('|')).explode('genre') 

   year     genre
0  1960     Drama
0  1960   Romance
0  1960  Thriller
1  1960       Spy
1  1960   Mystery
1  1960       Bio
2  1961     Drama
2  1961   Romance
3  1961     Drama
3  1961   Romance
4  1961     Drama
4  1961       Spy

接下来，做一个groupby并获得归一化的计数：

_.groupby('year')['genre'].value_counts(normalize=True)

year  genre   
1960  Bio         0.166667
      Drama       0.166667
      Mystery     0.166667
      Romance     0.166667
      Spy         0.166667
      Thriller    0.166667
1961  Drama       0.500000
      Romance     0.333333
      Spy         0.166667
Name: genre, dtype: float64

接下来，取消堆叠结果：

_.unstack(fill_value=0)

genre       Bio     Drama   Mystery   Romance       Spy  Thriller
year                                                             
1960   0.166667  0.166667  0.166667  0.166667  0.166667  0.166667
1961   0.000000  0.500000  0.000000  0.333333  0.166667  0.000000

最后，plot 与

_.plot.area()

Answer 2

您可以首先重新排列您的数据：

import pandas as pd
from itertools import groupby
from collections import defaultdict

data = """
1960  Drama|Romance|Thriller
1960         Spy|Mystery|Bio
1961           Drama|Romance
1961           Drama|Romance
1961               Drama|Spy
"""

# sort it first by year
lst = sorted((line.split() for line in data.split("\n") if line), key=lambda x: x[0])

# group it by year, expand the genres
result = {}
for key, values in groupby(lst, key=lambda x: x[0]):
    dct = defaultdict(int)
    for lst in values:
        for genre in lst[1].split("|"):
            dct[genre] += 1
    result[key] = dct

# feed it all to pandas
df = pd.DataFrame.from_dict(result, orient='index').fillna(0)

print(df)

这会产生

      Drama  Romance  Thriller  Spy  Mystery  Bio
1960      1        1       1.0    1      1.0  1.0
1961      3        2       0.0    1      0.0  0.0

拆分管道分隔的系列，按单独的系列分组，并在新列中返回每个拆分值的计数

问题描述

2 个解决方案

解决方案1
2 已采纳 2020-07-18 20:58:28

解决方案2
2 2020-07-18 21:06:20

拆分管道分隔的系列，按单独的系列分组，并在新列中返回每个拆分值的计数

问题描述

2 个解决方案

解决方案1 2 已采纳 2020-07-18 20:58:28

解决方案2 2 2020-07-18 21:06:20

解决方案1
2 已采纳 2020-07-18 20:58:28

解决方案2
2 2020-07-18 21:06:20