Here's a simple example of the sort of thing I'm wrestling with:
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: test = pd.DataFrame(np.random.randn(4,4),columns=list('ABCD'))
In [4]: for i in range(4):
....: test.iloc[i,i] = np.nan
In [5]: test
Out[5]:
A B C D
0 NaN 0.136841 -0.854138 -1.890888
1 -1.261724 NaN 0.875647 1.312823
2 1.130999 -0.208402 NaN 0.256644
3 -0.158458 -0.305250 0.902756 NaN
Now, if I use sum
to sum the rows, all the NaN
values are treated as zeros:
In [6]: test['Sum'] = test.loc[:,'A':'D'].sum(axis=1)
In [7]: test
Out[7]:
A B C D Sum
0 NaN 0.136841 -0.854138 -1.890888 -2.608185
1 -1.261724 NaN 0.875647 1.312823 0.926745
2 1.130999 -0.208402 NaN 0.256644 1.179241
3 -0.158458 -0.305250 0.902756 NaN 0.439048
But in my case, I may need to do a bit of work on the values first; for example scaling them:
In [8]: test['Sum2'] = test.A + test.B/2 - test.C/3 + test.D
In [9]: test
Out[9]:
A B C D Sum Sum2
0 NaN 0.136841 -0.854138 -1.890888 -2.608185 NaN
1 -1.261724 NaN 0.875647 1.312823 0.926745 NaN
2 1.130999 -0.208402 NaN 0.256644 1.179241 NaN
3 -0.158458 -0.305250 0.902756 NaN 0.439048 NaN
As you see, the NaN
values carry across into the arithmetic to produce NaN
output, which is what you'd expect.
Now, I don't want to replace all NaN
values in my dataframe with zeros: it is helpful to me to distinguish between zero and NaN
. I could replace NaN
with something else: I'm dealing with large volumes of student grades, and i need to distinguish between a grade of zero, and a NaN
which at the moment I'm using to indicate that the particular assessment task was not attempted. (It takes the place of what would be a blank cell in a traditional spreadsheet.) But whatever I replace the NaN
values with, it needs to be something that can be treated as zero in the operations I may perform. What are my options here?
使用fillna功能
test['Sum2'] = test.A.fillna(0) + test.B.fillna(0)/2 - test.C.fillna(0)/3 + test.D.fillna(0)
If the dataframe is not huge you can try:
test["Sum"] = test.sum(axis=1)
test2 = test.fillna(0)
test["Sum2"] = test2.A + test2.B/2 - test2.C/3 + test2.D
del test2
It will be interesting to know if there is a way to do the second sum in one line only.
Update
if you have 1e5
rows or less the method I suggested is slightly faster than the one suggested by kmcodes, then things changes.
n = int(1e5)
test = pd.DataFrame(np.random.randn(n,4),columns=list('ABCD'))
for i in range(4):
test.iloc[i,i] = np.nan
%%timeit
test2 = test.fillna(0)
test["Sum2"] = test2.A + test2.B/2 - test2.C/3 + test2.D
del test2
3.95 ms ± 51.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
test['Sum2'] = test.A.fillna(0) + test.B.fillna(0)/2 - test.C.fillna(0)/3 + test.D.fillna(0)
4.12 ms ± 16.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Update 2
I found this
In your case you can just
weights = [1, 1/2, -1/3, 1]
test["Sum2"] = test.fillna(0).mul(weights).sum(axis=1)
keep in mind that this seems to be consistently slower than the other two.
You can also concat and find the sum to get the features offered by sum()
ie
test['Sum2'] = pd.concat([test.A,test.B/2, test.C/(-3),test.D],1).sum(1)
A B C D Sum2
0 NaN 0.181923 -0.526074 1.084549 1.350869
1 0.999836 NaN -0.862583 -0.473933 0.813431
2 1.043463 0.252743 NaN -0.863199 0.306635
3 -0.047286 1.432500 0.100041 NaN 0.635616
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.