在分組數據框上創建新列

Question

我想使用當前數據框中的多列創建由組計算的新列。 在R ( tidyverse ) 中基本上是這樣的：

require(tidyverse)

data <- data_frame(
  a = c(1, 2, 1, 2, 3, 1, 2),
  b = c(1, 1, 1, 1, 1, 1, 1),
  c = c(1, 0, 1, 1, 0, 0, 1),
)

data %>% 
  group_by(a) %>% 
  mutate(d = cumsum(b) * c)

在pandas我認為我應該使用groupby並apply創建新列，然后將其分配給原始數據框。 這是我迄今為止嘗試過的：

import numpy as np
import pandas as pd

def create_new_column(data):
    return np.cumsum(data['b']) * data['c']    

data = pd.DataFrame({
    'a': [1, 2, 1, 2, 3, 1, 2],
    'b': [1, 1, 1, 1, 1, 1, 1],
    'c': [1, 0, 1, 1, 0, 0, 1],
})

# assign - throws error
data['d'] = data.groupby('a').apply(create_new_column)

# assign without index - incorrect order in output
data['d'] = data.groupby('a').apply(create_new_column).values

# assign to sorted data frame
data_sorted = data.sort_values('a')
data_sorted['d'] = data_sorted.groupby('a').apply(create_new_column).values

實現這一目標的首選方法是什么（理想情況下不對數據進行排序）？

Answer 1

添加參數group_keys=False以避免MultiIndex ，因此可能分配回新列：

data['d'] = data.groupby('a', group_keys=False).apply(create_new_column)

替代方法是刪除第一級：

data['d'] = data.groupby('a').apply(create_new_column).reset_index(level=0, drop=True)

print (data)
   a  b  c  d
0  1  1  1  1
1  2  1  0  0
2  1  1  1  2
3  2  1  1  2
4  3  1  0  0
5  1  1  0  0
6  2  1  1  3

詳情：

print (data.groupby('a').apply(create_new_column))
a   
1  0    1
   2    2
   5    0
2  1    0
   3    2
   6    3
3  4    0
dtype: int64

print (data.groupby('a', group_keys=False).apply(create_new_column))
0    1
2    2
5    0
1    0
3    2
6    3
4    0
dtype: int64

Answer 2

現在，您還可以像在 R 中一樣，使用datar在 python 中實現它：

>>> from datar.all import c, f, tibble, cumsum
>>> 
>>> data = tibble(
...   a = c(1, 2, 1, 2, 3, 1, 2),
...   b = c(1, 1, 1, 1, 1, 1, 1),
...   c = c(1, 0, 1, 1, 0, 0, 1),
... )
>>> 
>>> (data >>
...  group_by(f.a) >>
...  mutate(d=cumsum(f.b) * f.c))
   a  b  c  d
0  1  1  1  1
1  2  1  0  0
2  1  1  1  2
3  2  1  1  2
4  3  1  0  0
5  1  1  0  0
6  2  1  1  3
[Groups: ['a'] (n=3)]

我是包的作者。 如果您有任何問題，請隨時提交問題。

在分組數據框上創建新列

問題描述

2 個解決方案

解決方案1
2 已采納 2019-01-10 09:19:44

解決方案2
1 2021-06-08 20:52:38

在分組數據框上創建新列

問題描述

2 個解決方案

解決方案1 2 已采納 2019-01-10 09:19:44

解決方案2 1 2021-06-08 20:52:38

解決方案1
2 已采納 2019-01-10 09:19:44

解決方案2
1 2021-06-08 20:52:38