[英]Copying results of a function apply after groupby into a pandas column
I am trying to do a pandas
equivalent of the following data.table
operations:我正在尝试做一个相当于以下
data.table
操作的pandas
:
dt <- data.table(id = 1:10, x = rnorm(40))
dt <- dt[order(id)]
dt[, diff_x := c(0,diff(x)), by = id]
head(dt, 12)
# output:
id x diff_x
1: 1 0.01419519 0.00000000
2: 1 -0.39539869 -0.40959388
3: 1 -0.43918689 -0.04378821
4: 1 -0.79905967 -0.35987278
5: 2 0.59555572 0.00000000
6: 2 -0.21933639 -0.81489211
7: 2 -0.65462968 -0.43529329
8: 2 0.99307684 1.64770652
9: 3 -1.31185544 0.00000000
10: 3 1.23649358 2.54834902
11: 3 0.66359594 -0.57289764
12: 3 1.77078647 1.10719053
First of all, I am not sure how to do a diff
in an easy way with padding that I did above, so I wrote my own function for that.首先,我不确定如何使用我上面所做的填充以简单的方式进行
diff
,因此我为此编写了自己的函数。 But, more importantly, I am not sure how to copy the result of my groupby
operation back into my pandas
dataframe as a new column (the way I do easily above with data.table
).但是,更重要的是,我不知道怎么我的结果复制
groupby
操作回到我的pandas
数据帧作为新列(我这样做很容易与上面的方式data.table
)。 Here is what I tried so far:这是我到目前为止尝试过的:
def diff_pad(vect):
return(np.concatenate([[0], np.diff(vect)]))
df = pd.DataFrame()
df['id'] = list((range(1,11))) * 4
df.sort(['id'], inplace=True)
df['x'] = rand(40)
diffz = df.groupby('id')['x'].apply(diff_pad)
df['diffz'] = diffz
print(df.head(10))
#out:
id x diffz
0 1 0.757153 NaN
30 1 0.869001 NaN
10 1 0.140684 [0.0, 0.362003972215, -0.742119725957, -0.0684...
20 1 0.791483 NaN
21 2 0.941333 NaN
1 2 0.504867 [0.0, 0.111848720078, -0.728317633944, 0.65079...
31 2 0.273321 NaN
11 2 0.118802 NaN
2 3 0.848048 [0.0, -0.436465430463, -0.231545666932, -0.154...
12 3 0.357192 NaN
Edit:编辑:
In R/data.table, I can apply an arbitrary function that takes any columns of the table grouped by
another set of columns and assigns a result to a new column.在 R/data.table 中,我可以应用任意函数,该函数采用
by
另一组列分组的表中的任何列,并将结果分配给新列。
Eg:例如:
library(data.table)
dt <- data.table(id = 1:10, x = rnorm(40), y = rnorm(40))
dt <- dt[order(id)]
my_funct <- function(x, y) {
return(sqrt(max(x)^2 + min(y)^2))
}
dt[, z := my_funct(x, y), by = id]
head(dt, 12)
# out:
id x y z
1: 1 0.26012913 0.7612974 1.2433969
2: 1 1.19113080 1.4228528 1.2433969
3: 1 -0.07970657 -0.3567118 1.2433969
4: 1 -0.33129374 0.7879845 1.2433969
5: 2 0.60868698 0.9716669 0.8872687
6: 2 -0.72751776 0.0392282 0.8872687
7: 2 -0.17724141 0.2599093 0.8872687
8: 2 0.13324134 -0.6455587 0.8872687
9: 3 -1.91015664 -1.1340993 2.2408919
10: 3 -0.95696559 -0.2624625 2.2408919
11: 3 1.93272221 0.2788335 2.2408919
12: 3 0.46391776 -0.9080321 2.2408919
How would I do something like that in pandas?我将如何在熊猫中做这样的事情?
1st off, welcome to pandas!第一关,欢迎来到熊猫!
Second, I'd start off defining df
like this.其次,我会像这样定义
df
。 This is a style preference of mine and by no means canonical.这是我的风格偏好,绝不是规范的。
import numpy as np
import pandas as pd
df = pd.DataFrame(dict(
id=np.repeat(np.arange(1, 11), 4),
x=np.random.randn(40)
))
Lastly, if I understood you correctly:最后,如果我理解正确的话:
df['x_diff'] = df.groupby('id').x.diff().fillna(0)
df
you could have used apply
with your own function like this:您可以将
apply
与您自己的函数一起使用,如下所示:
def my_diff(x):
return x.diff().fillna(0)
df.groupby('id').apply(my_diff)
The reason yours didn't work was because you returned a numpy array with no index values to line up with the pandas series your function was being applied to.你的不起作用的原因是因为你返回了一个没有索引值的 numpy 数组来与你的函数所应用的熊猫系列对齐。 You see in your results that the answer is there, but it's crammed into a single cell.
您在结果中看到答案就在那里,但它被塞进了一个单元格中。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.