[英]pandas fillna by group for multiple columns
In a dataset like this one (CSV format), where there are several columns with values, how can I use fillna
alongside df.groupby("DateSent")
to fill in all desired columns with min()/3
of the group?在像这样的数据集(CSV 格式)中,有几列带有值,我如何使用
fillna
和df.groupby("DateSent")
来填充所有需要的列与组的min()/3
?
In [5]: df.head()
Out[5]:
ID DateAcquired DateSent data value measurement values
0 1 20210518 20220110 6358.434713 556.0 317.869897 3.565781
1 1 20210719 20220210 6508.458382 1468.0 774.337509 5.565384
2 1 20210719 20220310 6508.466246 1.0 40.837533 1.278085
3 1 20200420 20220410 6507.664194 48.0 64.335047 1.604183
4 1 20210328 20220510 6508.451227 0.0 40.337486 1.270236
According to this other thread on SO , one way of doing it would be one by one:根据SO 上的另一个线程,一种方法是一个接一个:
df["data"] = df.groupby("DateSent")["data"].transform(lambda x: x.fillna(x.min()/3))
df["value"] = df.groupby("DateSent")["value"].transform(lambda x: x.fillna(x.min()/3))
df["measurement"] = df.groupby("DateSent")["measurement"].transform(lambda x: x.fillna(x.min()/3))
df["values"] = df.groupby("DateSent")["values"].transform(lambda x: x.fillna(x.min()/3))
In my original dataset where I have 100000 such columns, I can technically loop over all desired column names.在我有 100000 个这样的列的原始数据集中,我可以在技术上遍历所有所需的列名。 But is there a better/faster way of doing this?
但是有没有更好/更快的方法来做到这一点? Perhaps something already implemented in
pandas
?也许已经在
pandas
中实现了一些东西?
One way you could do this is to get all the columns you want to impute in a list - I will assume that you want all the numerical
columns (except ID, DateAcquired, DataSent)您可以这样做的一种方法是在列表中获取您想要估算的所有列 - 我将假设您想要所有
numerical
列(ID、DateAcquired、DataSent 除外)
fti = [i for i in df.iloc[:,3:].columns if df[i].dtypes != 'object'] # features to impute
Then, you can create a new df
, with only the imputed values:然后,您可以创建一个新的
df
,仅包含估算值:
imputed = df.groupby("DateSent")[fti].transform(lambda x: x.fillna(x.min()/3))
imputed.head(5)
data value measurement values
0 6358.434713 556.0 317.869897 3.565781
1 6508.458382 1468.0 774.337509 5.565384
2 6508.466246 1.0 40.837533 1.278085
3 6507.664194 48.0 64.335047 1.604183
4 6508.451227 0.0 40.337486 1.270236
Lastly you can concat
:最后你可以
concat
:
res = pd.concat([df[df.columns.symmetric_difference(imputed.columns)],imputed],axis=1)
res.head(15)
DateAcquired DateSent ID data value measurement values
0 20210518 20220110 1 6358.434713 556.0 317.869897 3.565781
1 20210719 20220210 1 6508.458382 1468.0 774.337509 5.565384
2 20210719 20220310 1 6508.466246 1.0 40.837533 1.278085
3 20200420 20220410 1 6507.664194 48.0 64.335047 1.604183
4 20210328 20220510 1 6508.451227 0.0 40.337486 1.270236
5 20210518 20220610 1 6508.474031 3.0 15.000000 0.774597
6 20210108 20220110 2 6508.402472 897.0 488.837335 4.421933
7 20210110 20220210 2 6508.410493 52.0 111.000000 2.107131
8 20210119 20220310 2 6508.419065 800.0 440.337387 4.196844
9 20210108 20220410 2 6508.426063 89.0 84.837408 1.842144
10 20200109 20220510 2 6507.647600 978.0 529.334996 4.601456
11 20210919 20220610 2 6508.505563 1566.0 823.337655 5.738772
12 20211214 20220612 2 6508.528918 152.0 500.000000 4.472136
13 20210812 20220620 2 6508.497936 668.0 374.337631 3.869561
14 20210909 20220630 2 6508.506350 489.0 284.837657 3.375427
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.