[英]Pandas replace all NaN values with data from a different dataframe
I'm pretty new to Pandas and am kind of stucked with a problem to replace Nan-Values with median values from a different dataframe.我对 Pandas 很陌生,并且遇到了一个问题,即用来自不同数据帧的中值替换 Nan-Values。 The median dataframe has a different form, because I had to group the original df to get the medians.中位数数据框具有不同的形式,因为我必须对原始 df 进行分组以获得中位数。
My main dataframe df1 looks something like this:我的主要数据框 df1 看起来像这样:
permno yyyymm BookLeverage Cash RoE ShareIss1Y ShareIss5Y SP date industry_id STreversal Price Size ret
541 10006 197101 -1.907577 NaN 0.114616 0.000000 0.051689 1.197606 1971-01-29 37 -4.383562 -3.863358 -12.496377 0.043836
542 10006 197102 -1.907577 NaN 0.114616 0.000000 0.051689 1.220021 1971-02-26 37 0.577428 -3.844814 -12.477833 -0.005774
543 10006 197103 -1.907577 NaN 0.114616 0.000000 0.051689 1.118353 1971-03-31 37 -9.090909 -3.931826 -12.564844 0.090909
544 10006 197104 -1.907577 NaN 0.114616 0.000000 0.051689 NaN 1971-04-30 37 -16.176471 -4.081766 -12.714785 0.161765
545 10006 197105 -1.907577 NaN 0.114616 0.000000 0.051689 1.025366 1971-05-28 37 5.105485 -4.018633 -12.651651 -0.051055
Then I created a new dataframe df2 in which I grouped the former df by the yyyymm
and industry_id
column, and got the median for each time-industry panel.然后我创建了一个新的数据框 df2,其中我将前一个 df 按yyyymm
和industry_id
ID 列分组,并获得了每个时间行业面板的中位数。
The median df2 looks something like this:中位数 df2 看起来像这样:
permno BookLeverage Cash RoE ShareIss1Y \
yyyymm industry_id
197101 01 40957.5 -2.451327 NaN 0.015212 -0.306936
10 19254.0 -1.300565 NaN 0.123353 -0.002747
12 33081.5 -2.102402 NaN -0.001043 -0.255756
13 26470.0 -2.028418 NaN 0.116907 -0.005262
14 17830.0 -1.266574 NaN 0.110059 -0.000193
... ... ... ... ... ...
202112 80 78633.0 -3.037694 NaN 0.195342 NaN
82 52123.0 -3.093551 NaN 0.017580 NaN
83 13739.0 -2.802522 NaN 0.021025 NaN
87 78667.5 -3.103168 NaN 0.104524 NaN
97 91547.0 -3.054443 NaN 0.162610 NaN
ShareIss5Y SP STreversal Price Size \
yyyymm industry_id
197101 01 -7.591944 5.439985 -9.998244 -2.684046 -11.483201
10 -1.432833 0.517484 -4.504504 -3.367296 -11.826440
12 -20.622667 2.264890 -22.648810 -2.873900 -11.501783
13 -0.257821 0.752112 -5.429864 -3.607534 -12.362360
14 -0.223948 0.636665 -16.075773 -2.729726 -11.386150
... ... ... ... ... ...
202112 80 NaN NaN -10.960198 -4.539740 -16.024733
82 NaN NaN -1.664319 -2.740474 -13.882130
83 NaN NaN -2.383083 -4.835329 -15.843560
87 NaN NaN -5.109321 -4.585741 -15.844537
97 NaN NaN -1.535659 -4.487512 -16.339328
ret
yyyymm industry_id
197101 01 0.099982
10 0.045045
12 0.226488
13 0.054299
14 0.160758
... ...
202112 80 0.109602
82 0.016643
83 0.023831
87 0.051093
97 0.015357
What I'm now trying to achieve, is to fill the NaN-values in the df1 with the corresponding value from df2.我现在想要实现的是用 df2 中的相应值填充 df1 中的 NaN 值。 So that for example the SP column in row 544 would get the value which is in df2 at yyyymm
197104 with industry_id
37.因此,例如,第 544 行中的 SP 列将获得 df2 中yyyymm
197104 中的值, industry_id
ID 为 37。
I tried to map over all rows and inside that over all columns and replace the NaN-values, but this broke my dataframe:我试图映射所有行和所有列内部并替换 NaN 值,但这破坏了我的数据框:
def fill_nan_with_median(row):
date = int(row['yyyymm'])
industry = row['industry_id']
for label, column in row.items():
if column == np.nan:
median = df_median.loc[(date, industry), label]
df_1.loc[index, label] = median
for index, row in df_1.iterrows():
fill_nan_with_median(row)
This is all done without data, therefore you may need to change something (hopefully not),这一切都是在没有数据的情况下完成的,因此您可能需要更改某些内容(希望不会),
df_grouped_median = df1.groupby(['yyyymm', 'industry_id'], as_index=False).SP.median().rename(
columns={"SP":"median"})
df = df.merge(df_grouped_median, on=['yyyymm', 'industry_id'], how='left')
df['SP'].fillna(df['median'])
This answer takes a table lookup approach.这个答案采用表查找方法。 For NaNs
in the SP
column it does a lookup into df2
for the median SP
value.对于SP
列中的NaNs
,它会在df2
中查找SP
中值。 This answer also assumes that yyyymm
and industry_id
are strings and not numeric.此答案还假设yyyymm
和industry_id
ID 是字符串而不是数字。
df1.apply(lambda x: x['SP'] if x['SP']==x['SP'] else df2.at[(x['yyyymm'],x['industry_id']),'SP'] , axis=1)
541 1.197606
542 1.220021
543 1.118353
544 0.636665
545 1.025366
Note that non-NaNs are detected by the weird looking x['SP']==x['SP']
leveraging the fact that NaN != NaN
.请注意,非 NaN 是由看起来很奇怪的x['SP']==x['SP']
检测到的,它利用了NaN != NaN
的事实。
Your df1
was used along with a df2
that I created:您的df1
与我创建的df2
一起使用:
SP
yyyymm industry_id
197104 01 5.439985
10 0.517484
12 2.264890
13 0.752112
37 0.636665
All that you need to do after that is assign that back to the df1
frame:之后您需要做的就是将其分配回df1
框架:
df1.assign(SP=df1.apply(lambda x: x['SP'] if x['SP']==x['SP'] else df2.at[(x['yyyymm'],x['industry_id']),'SP'] , axis=1))
permno yyyymm BookLeverage Cash RoE ShareIss1Y ShareIss5Y \
541 10006 197101 -1.907577 NaN 0.114616 0.0 0.051689
542 10006 197102 -1.907577 NaN 0.114616 0.0 0.051689
543 10006 197103 -1.907577 NaN 0.114616 0.0 0.051689
544 10006 197104 -1.907577 NaN 0.114616 0.0 0.051689
545 10006 197105 -1.907577 NaN 0.114616 0.0 0.051689
SP date industry_id STreversal Price Size \
541 1.197606 1971-01-29 37 -4.383562 -3.863358 -12.496377
542 1.220021 1971-02-26 37 0.577428 -3.844814 -12.477833
543 1.118353 1971-03-31 37 -9.090909 -3.931826 -12.564844
544 0.636665 1971-04-30 37 -16.176471 -4.081766 -12.714785
545 1.025366 1971-05-28 37 5.105485 -4.018633 -12.651651
ret
541 0.043836
542 -0.005774
543 0.090909
544 0.161765
545 -0.051055
Or by:或通过:
df1['SP'] = df1.apply(lambda x: x['SP'] if x['SP']==x['SP'] else df2.at[(x['yyyymm'],x['industry_id']),'SP'] , axis=1)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.