简体   繁体   English

pandas 按组填充多列

[英]pandas fillna by group for multiple columns

In a dataset like this one (CSV format), where there are several columns with values, how can I use fillna alongside df.groupby("DateSent") to fill in all desired columns with min()/3 of the group?像这样的数据集(CSV 格式)中,有几列带有值,我如何使用fillnadf.groupby("DateSent")来填充所有需要的列与组的min()/3

In [5]: df.head()
Out[5]: 
  ID  DateAcquired  DateSent         data   value  measurement    values
0  1      20210518  20220110  6358.434713   556.0   317.869897  3.565781
1  1      20210719  20220210  6508.458382  1468.0   774.337509  5.565384
2  1      20210719  20220310  6508.466246     1.0    40.837533  1.278085
3  1      20200420  20220410  6507.664194    48.0    64.335047  1.604183
4  1      20210328  20220510  6508.451227     0.0    40.337486  1.270236

According to this other thread on SO , one way of doing it would be one by one:根据SO 上的另一个线程,一种方法是一个接一个:

df["data"]        = df.groupby("DateSent")["data"].transform(lambda x: x.fillna(x.min()/3))
df["value"]       = df.groupby("DateSent")["value"].transform(lambda x: x.fillna(x.min()/3))
df["measurement"] = df.groupby("DateSent")["measurement"].transform(lambda x: x.fillna(x.min()/3))
df["values"]      = df.groupby("DateSent")["values"].transform(lambda x: x.fillna(x.min()/3))

In my original dataset where I have 100000 such columns, I can technically loop over all desired column names.在我有 100000 个这样的列的原始数据集中,我可以在技术上遍历所有所需的列名。 But is there a better/faster way of doing this?但是有没有更好/更快的方法来做到这一点? Perhaps something already implemented in pandas ?也许已经在pandas中实现了一些东西?

One way you could do this is to get all the columns you want to impute in a list - I will assume that you want all the numerical columns (except ID, DateAcquired, DataSent)您可以这样做的一种方法是在列表中获取您想要估算的所有列 - 我将假设您想要所有numerical列(ID、DateAcquired、DataSent 除外)

fti = [i for i in df.iloc[:,3:].columns if df[i].dtypes != 'object'] # features to impute

Then, you can create a new df , with only the imputed values:然后,您可以创建一个新的df ,仅包含估算值:

imputed = df.groupby("DateSent")[fti].transform(lambda x: x.fillna(x.min()/3))

imputed.head(5)
          data   value  measurement    values
0  6358.434713   556.0   317.869897  3.565781
1  6508.458382  1468.0   774.337509  5.565384
2  6508.466246     1.0    40.837533  1.278085
3  6507.664194    48.0    64.335047  1.604183
4  6508.451227     0.0    40.337486  1.270236

Lastly you can concat :最后你可以concat

res = pd.concat([df[df.columns.symmetric_difference(imputed.columns)],imputed],axis=1)

res.head(15)

    DateAcquired  DateSent ID         data   value  measurement    values
0       20210518  20220110  1  6358.434713   556.0   317.869897  3.565781
1       20210719  20220210  1  6508.458382  1468.0   774.337509  5.565384
2       20210719  20220310  1  6508.466246     1.0    40.837533  1.278085
3       20200420  20220410  1  6507.664194    48.0    64.335047  1.604183
4       20210328  20220510  1  6508.451227     0.0    40.337486  1.270236
5       20210518  20220610  1  6508.474031     3.0    15.000000  0.774597
6       20210108  20220110  2  6508.402472   897.0   488.837335  4.421933
7       20210110  20220210  2  6508.410493    52.0   111.000000  2.107131
8       20210119  20220310  2  6508.419065   800.0   440.337387  4.196844
9       20210108  20220410  2  6508.426063    89.0    84.837408  1.842144
10      20200109  20220510  2  6507.647600   978.0   529.334996  4.601456
11      20210919  20220610  2  6508.505563  1566.0   823.337655  5.738772
12      20211214  20220612  2  6508.528918   152.0   500.000000  4.472136
13      20210812  20220620  2  6508.497936   668.0   374.337631  3.869561
14      20210909  20220630  2  6508.506350   489.0   284.837657  3.375427

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM