應用引用多列的 groupby 最快最有效的方法

Question

假設我們有一個數據集。

tmp = pd.DataFrame({'hi': [1,2,3,3,5,6,3,2,3,2,1],
                    'bye': [12,23,35,35,53,62,31,22,33,22,12],
                    'yes': [12,2,32,3,5,6,23,2,32,2,21],
                    'no': [1,92,93,3,95,6,33,2,33,22,1],
                    'maybe': [91,2,32,3,95,69,3,2,93,2,1]})

在 python 中，我們可以輕松執行tmp.groupby('hi').agg(total_bye = ('bye', sum))以獲得每個組的 bye 總和。 但是，如果我想引用多個列，在 python 中執行此操作的最快、最有效和最少的干凈（易於閱讀）編寫代碼是什么？ 特別是，我可以使用 df.groupby(my_cols).agg() 來做到這一點嗎？ 最快的替代品是什么？ 我願意（實際上更喜歡）使用比 pandas 更快的庫，例如 dask 或 vaex。

例如，在 R data.table 中，我們可以很容易地做到這一點，而且速度非常快


# In R, assume this object is a data.table
# In a single line, the below code groups by 'hi' and then creates my_new_col column based on if bye > 5 and yes <= 20, taking the sum of 'no' for each group.
tmp[, .(my_new_col = sum(ifelse(bye > 5 & yes < 20, no, 0))), by = 'hi']

# output 1
   hi my_new_col
1:  1          1
2:  2        116
3:  3          3
4:  5         95
5:  6          6

# Similarly, we can even group by a rule instead of creating a new col to group by. See below

tmp[, .(my_new_col = sum(ifelse(bye > 5 & yes < 20, no, 0))), by = .(new_rule = ifelse(hi > 3, 1, 0))]

# output 2
   new_rule my_new_col
1:        0        120
2:        1        101

# We can even apply multiple aggregate functions in parallel using data.table
agg_fns <- function(x) list(sum=sum(as.double(x), na.rm=T),
                            mean=mean(as.double(x), na.rm=T),
                            min=min(as.double(x), na.rm=T),
                            max=max(as.double(x), na.rm=T))

tmp[,
    unlist(
        list(N = .N, # add a N column (row count) to the summary
            unlist(mclapply(.SD, agg_fns, mc.cores = 12), recursive = F)), # apply all agg_fns over all .SDcols
    recursive = F),
    .SDcols = !unique(c(names('hi'), as.character(unlist('hi'))))]

output 3:
   N bye.sum bye.mean bye.min bye.max yes.sum yes.mean yes.min yes.max no.sum  no.mean no.min
1: 11     340 30.90909      12      62     140 12.72727       2      32    381 34.63636      1
   no.max maybe.sum maybe.mean maybe.min maybe.max
1:     95       393   35.72727         1        95

我們在 python 中有同樣的靈活性嗎？

Answer 1

您可以在所有需要的列上使用 agg 並添加前綴：

tmp.groupby('hi').agg('sum').add_prefix('total_')

output：

    total_bye  total_yes  total_no  total_maybe
hi                                             
1          24         33         2           92
2          67          6       116            6
3         134         90       162          131
5          53          5        95           95
6          62          6         6           69

您甚至可以使用字典靈活組合列和操作：

tmp.groupby('hi').agg(**{'%s_%s' % (label,c):  (c, op)
                         for c in tmp.columns
                         for (label,op) in [('total', 'sum'), ('average', 'mean')]
                        })

output：

    total_hi  average_hi  total_bye  average_bye  total_yes  average_yes  total_no  average_no  total_maybe  average_maybe
hi                                                                                                                        
1          2           1         24    12.000000         33         16.5         2    1.000000           92          46.00
2          6           2         67    22.333333          6          2.0       116   38.666667            6           2.00
3         12           3        134    33.500000         90         22.5       162   40.500000          131          32.75
5          5           5         53    53.000000          5          5.0        95   95.000000           95          95.00
6          6           6         62    62.000000          6          6.0         6    6.000000           69          69.00

應用引用多列的 groupby 最快最有效的方法

問題描述

1 個解決方案

解決方案1
1 已采納 2021-08-14 05:26:53

應用引用多列的 groupby 最快最有效的方法

問題描述

1 個解決方案

解決方案1 1 已采納 2021-08-14 05:26:53

解決方案1
1 已采納 2021-08-14 05:26:53