简体   繁体   中英

Pandas: Sum of the Max 3 Column Values in Each Row

Sample data:

        0        1         2        3       4  Sum_max_3
0  591949  2575703  22479693  2202865  499835   27258261
1    2705    11426    339913     5438    1016     356777
2      18      119      4162       18       0       4299
3     264     1213     14999      246     116      16476
4       0       35      1292       10       0       1337
5       0        0      1442        0       0       1442
6       0       28      5596       20       0       5644
7       0       10       102       56       0        168
8      33        0      1224       17       0       1274
9      39      198      9505       62      35       9765

I want to get the sum of the 3 columns with the largest values in the row. These are different columns for every row (Sum_max_3).

I have many many columns so I need to do this automatically for all of them.

You should use numpy.partition to avoid a full sort. This will reduce the time complexity of finding the three maximum values from O(nlogn) to O(n) , which will make a massive difference on large DataFrames:

np.sum(np.partition(df.values, -3)[:, -3:], 1)

df.assign(max3=np.sum(np.partition(df.values, -3)[:, -3:], 1))

        0        1         2        3       4      max3
0  591949  2575703  22479693  2202865  499835  27258261
1    2705    11426    339913     5438    1016    356777
2      18      119      4162       18       0      4299
3     264     1213     14999      246     116     16476
4       0       35      1292       10       0      1337
5       0        0      1442        0       0      1442
6       0       28      5596       20       0      5644
7       0       10       102       56       0       168
8      33        0      1224       17       0      1274
9      39      198      9505       62      35      9765

Timings

In [411]: df = pd.DataFrame(np.random.rand(5000, 5000))

In [412]: %timeit np.sum(np.sort(df.values)[:,-3:],axis=1)
1.69 s ± 92.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [413]: %timeit np.sum(np.partition(df.values, -3)[:, -3:], 1)
364 ms ± 23.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Use the underlying np array (using df.values ), sort each row using np.sort , and sum the last 3 columns (which will be the 3 maximum values for each row):

>>> df
        0        1         2        3       4
0  591949  2575703  22479693  2202865  499835
1    2705    11426    339913     5438    1016
2      18      119      4162       18       0
3     264     1213     14999      246     116
4       0       35      1292       10       0
5       0        0      1442        0       0
6       0       28      5596       20       0
7       0       10       102       56       0
8      33        0      1224       17       0
9      39      198      9505       62      35


>>> df['Sum_max_3'] = np.sum(np.sort(df.values)[:,-3:],axis=1)

>>> df
        0        1         2        3       4  Sum_max_3
0  591949  2575703  22479693  2202865  499835   27258261
1    2705    11426    339913     5438    1016     356777
2      18      119      4162       18       0       4299
3     264     1213     14999      246     116      16476
4       0       35      1292       10       0       1337
5       0        0      1442        0       0       1442
6       0       28      5596       20       0       5644
7       0       10       102       56       0        168
8      33        0      1224       17       0       1274
9      39      198      9505       62      35       9765

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM