简体   繁体   English

更快的替代 groupby,unstack 然后 fillna

[英]Faster alternative to groupby, unstack then fillna

I'm currently doing the following operations based on a dataframe (A) made of two columns with multiple thousands of unique values each.我目前正在基于由两列组成的数据框 (A) 执行以下操作,每列具有数千个唯一值。

>>> pd.DataFrame({
    'col1': ['foo', 'bar', 'bar', 'foo', 'baz', 'bar', 'baz'],
    'col2': ['abc', 'def', 'abc', 'abc', 'def', 'abc', 'ghi']
})

  col1 col2
0  foo  abc
1  bar  def
2  bar  abc
3  foo  abc
4  baz  def
5  bar  abc
6  baz  ghi

The operations performed on this dataframe are:对此数据帧执行的操作是:

res = df.groupby(['col1', 'col2']).size().unstack().fillna(0)

The output is a table (B) with unique values of col1 in rows and unique values of col2 in columns, and each cell is the count of rows, from the original dataframe, matching this combination of unique values.输出是一个表 (B),其中行中col1的唯一值和列中col2的唯一值,每个单元格都是来自原始数据帧的行数,与唯一值的组合匹配。

>>> res
col2  abc  def  ghi
col1               
bar   2.0  1.0  0.0
baz   0.0  1.0  1.0
foo   2.0  0.0  0.0

The amount of time spent in each operation is approximately the following:每个操作花费的时间大约如下:

  • groupby().size() -> 5% groupby().size() -> 5%
  • unstack() -> 15% unstack() -> 15%
  • fillna(0) -> 80% fillna(0) -> 80%

The whole sequence can take about 30 minutes on a real dataset (similar structure as above, just more rows and more unique values).在一个真实的数据集上,整个序列可能需要大约 30 分钟(类似于上面的结构,只是更多的行和更多的唯一值)。

Is there a better/faster alternative to get from (A) the original dataframe to (B) the end-result table?从 (A) 原始数据帧到 (B) 最终结果表是否有更好/更快的替代方法? The most costly operation is by far the final fillna(0) so I'd be interested in an alternative for this bit in particular, but an entirely different approach would be great as well.迄今为止,最昂贵的操作是最后的fillna(0) ,所以我特别对这个位的替代方案感兴趣,但完全不同的方法也很好。

Note: converting the strings to integer in the original df speeds up the groupby().size() operation by about 5x, however it doesn't really affect the following operations注意:将原始df中的字符串转换为整数会使groupby().size()操作加快约 5 倍,但它并不真正影响以下操作

通过设置一个fill_value ,在取消unstack的同一步骤中利用填充 NA:

 >>> df.groupby(['col1', 'col2']).size().unstack(fill_value=0)

timeit on Google Colab: Google timeit上的时间:

%timeit df.groupby(['col1', 'col2']).size().unstack().fillna(0)
1000 loops, best of 5: 1.54 ms per loop

%timeit df.groupby(['col1', 'col2']).size().unstack(fill_value=0)
1000 loops, best of 5: 1.47 ms per loop

%timeit df.groupby(['col1','col2'])['col2'].count().unstack(fill_value=0)
1000 loops, best of 5: 1.43 ms per loop

%timeit pd.crosstab(index=df.col1, columns=df.col2)
100 loops, best of 5: 8.11 ms per loop

update: I've included the rafaelc answer更新:我已经包含了 rafaelc 的答案

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM