[英]Treating NaN as zero in arithmetic operations?
Here's a simple example of the sort of thing I'm wrestling with: 这是我正在努力解决的一个简单例子:
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: test = pd.DataFrame(np.random.randn(4,4),columns=list('ABCD'))
In [4]: for i in range(4):
....: test.iloc[i,i] = np.nan
In [5]: test
Out[5]:
A B C D
0 NaN 0.136841 -0.854138 -1.890888
1 -1.261724 NaN 0.875647 1.312823
2 1.130999 -0.208402 NaN 0.256644
3 -0.158458 -0.305250 0.902756 NaN
Now, if I use sum
to sum the rows, all the NaN
values are treated as zeros: 现在,如果我使用sum
对行求和,则所有NaN
值都被视为零:
In [6]: test['Sum'] = test.loc[:,'A':'D'].sum(axis=1)
In [7]: test
Out[7]:
A B C D Sum
0 NaN 0.136841 -0.854138 -1.890888 -2.608185
1 -1.261724 NaN 0.875647 1.312823 0.926745
2 1.130999 -0.208402 NaN 0.256644 1.179241
3 -0.158458 -0.305250 0.902756 NaN 0.439048
But in my case, I may need to do a bit of work on the values first; 但就我而言,我可能需要先对价值观做一些工作; for example scaling them: 例如缩放它们:
In [8]: test['Sum2'] = test.A + test.B/2 - test.C/3 + test.D
In [9]: test
Out[9]:
A B C D Sum Sum2
0 NaN 0.136841 -0.854138 -1.890888 -2.608185 NaN
1 -1.261724 NaN 0.875647 1.312823 0.926745 NaN
2 1.130999 -0.208402 NaN 0.256644 1.179241 NaN
3 -0.158458 -0.305250 0.902756 NaN 0.439048 NaN
As you see, the NaN
values carry across into the arithmetic to produce NaN
output, which is what you'd expect. 如您所见, NaN
值会进入算术运算以产生NaN
输出,这正是您所期望的。
Now, I don't want to replace all NaN
values in my dataframe with zeros: it is helpful to me to distinguish between zero and NaN
. 现在,我不想用零替换我的数据帧中的所有NaN
值:我有助于区分零和NaN
。 I could replace NaN
with something else: I'm dealing with large volumes of student grades, and i need to distinguish between a grade of zero, and a NaN
which at the moment I'm using to indicate that the particular assessment task was not attempted. 我可以用其他东西代替NaN
:我正在处理大量的学生成绩,我需要区分零等级和NaN
,我现在用它来表示特定的评估任务不是尝试。 (It takes the place of what would be a blank cell in a traditional spreadsheet.) But whatever I replace the NaN
values with, it needs to be something that can be treated as zero in the operations I may perform. (它取代了传统电子表格中的空白单元格。)但无论我用什么替换NaN
值,它都需要在我可能执行的操作中被视为零。 What are my options here? 我有什么选择?
使用fillna功能
test['Sum2'] = test.A.fillna(0) + test.B.fillna(0)/2 - test.C.fillna(0)/3 + test.D.fillna(0)
If the dataframe is not huge you can try: 如果数据帧不是很大,您可以尝试:
test["Sum"] = test.sum(axis=1)
test2 = test.fillna(0)
test["Sum2"] = test2.A + test2.B/2 - test2.C/3 + test2.D
del test2
It will be interesting to know if there is a way to do the second sum in one line only. 知道是否有办法只在一行中进行第二次求和将会很有趣。
Update 更新
if you have 1e5
rows or less the method I suggested is slightly faster than the one suggested by kmcodes, then things changes. 如果你有1e5
行或更少,我建议的方法比kmcodes建议的方法略快,那么事情会发生变化。
n = int(1e5)
test = pd.DataFrame(np.random.randn(n,4),columns=list('ABCD'))
for i in range(4):
test.iloc[i,i] = np.nan
%%timeit
test2 = test.fillna(0)
test["Sum2"] = test2.A + test2.B/2 - test2.C/3 + test2.D
del test2
3.95 ms ± 51.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
test['Sum2'] = test.A.fillna(0) + test.B.fillna(0)/2 - test.C.fillna(0)/3 + test.D.fillna(0)
4.12 ms ± 16.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Update 2 更新2
In your case you can just 在你的情况下,你可以
weights = [1, 1/2, -1/3, 1]
test["Sum2"] = test.fillna(0).mul(weights).sum(axis=1)
keep in mind that this seems to be consistently slower than the other two. 请记住,这似乎始终比其他两个慢。
You can also concat and find the sum to get the features offered by sum()
ie 您还可以连接并找到总和以获得sum()
提供的功能
test['Sum2'] = pd.concat([test.A,test.B/2, test.C/(-3),test.D],1).sum(1)
A B C D Sum2
0 NaN 0.181923 -0.526074 1.084549 1.350869
1 0.999836 NaN -0.862583 -0.473933 0.813431
2 1.043463 0.252743 NaN -0.863199 0.306635
3 -0.047286 1.432500 0.100041 NaN 0.635616
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.