简体   繁体   中英

Treating NaN as zero in arithmetic operations?

Here's a simple example of the sort of thing I'm wrestling with:

In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: test = pd.DataFrame(np.random.randn(4,4),columns=list('ABCD'))
In [4]: for i in range(4):
  ....:    test.iloc[i,i] = np.nan

In [5]: test
Out[5]:
           A         B         C         D
0        NaN  0.136841 -0.854138 -1.890888
1  -1.261724       NaN  0.875647  1.312823
2   1.130999 -0.208402       NaN  0.256644
3  -0.158458 -0.305250  0.902756       NaN 

Now, if I use sum to sum the rows, all the NaN values are treated as zeros:

In [6]: test['Sum'] = test.loc[:,'A':'D'].sum(axis=1)

In [7]: test
Out[7]: 
          A         B         C         D       Sum
0       NaN  0.136841 -0.854138 -1.890888 -2.608185
1 -1.261724       NaN  0.875647  1.312823  0.926745
2  1.130999 -0.208402       NaN  0.256644  1.179241
3 -0.158458 -0.305250  0.902756       NaN  0.439048    

But in my case, I may need to do a bit of work on the values first; for example scaling them:

In [8]: test['Sum2'] = test.A + test.B/2 - test.C/3 + test.D

In [9]: test
Out[9]: 
          A         B         C         D       Sum  Sum2
0       NaN  0.136841 -0.854138 -1.890888 -2.608185   NaN
1 -1.261724       NaN  0.875647  1.312823  0.926745   NaN
2  1.130999 -0.208402       NaN  0.256644  1.179241   NaN
3 -0.158458 -0.305250  0.902756       NaN  0.439048   NaN

As you see, the NaN values carry across into the arithmetic to produce NaN output, which is what you'd expect.

Now, I don't want to replace all NaN values in my dataframe with zeros: it is helpful to me to distinguish between zero and NaN . I could replace NaN with something else: I'm dealing with large volumes of student grades, and i need to distinguish between a grade of zero, and a NaN which at the moment I'm using to indicate that the particular assessment task was not attempted. (It takes the place of what would be a blank cell in a traditional spreadsheet.) But whatever I replace the NaN values with, it needs to be something that can be treated as zero in the operations I may perform. What are my options here?

使用fillna功能

test['Sum2'] = test.A.fillna(0) + test.B.fillna(0)/2 - test.C.fillna(0)/3 + test.D.fillna(0)

If the dataframe is not huge you can try:

test["Sum"] = test.sum(axis=1)
test2 = test.fillna(0)
test["Sum2"] = test2.A + test2.B/2 - test2.C/3 + test2.D
del test2

It will be interesting to know if there is a way to do the second sum in one line only.

Update

if you have 1e5 rows or less the method I suggested is slightly faster than the one suggested by kmcodes, then things changes.

n = int(1e5)
test = pd.DataFrame(np.random.randn(n,4),columns=list('ABCD'))
for i in range(4):
    test.iloc[i,i] = np.nan

%%timeit
test2 = test.fillna(0)
test["Sum2"] = test2.A + test2.B/2 - test2.C/3 + test2.D
del test2
3.95 ms ± 51.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
test['Sum2'] = test.A.fillna(0) + test.B.fillna(0)/2 - test.C.fillna(0)/3 + test.D.fillna(0)
4.12 ms ± 16.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Update 2

I found this

In your case you can just

weights = [1, 1/2, -1/3, 1]
test["Sum2"] = test.fillna(0).mul(weights).sum(axis=1)

keep in mind that this seems to be consistently slower than the other two.

You can also concat and find the sum to get the features offered by sum() ie

test['Sum2'] = pd.concat([test.A,test.B/2, test.C/(-3),test.D],1).sum(1)

       A         B         C         D      Sum2
0       NaN  0.181923 -0.526074  1.084549  1.350869
1  0.999836       NaN -0.862583 -0.473933  0.813431
2  1.043463  0.252743       NaN -0.863199  0.306635
3 -0.047286  1.432500  0.100041       NaN  0.635616

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM