简体   繁体   中英

Efficient way to compute difference of all rows in dask dataframe

I am looking for efficient way to compute difference of all rows in dask dataframe(preferred) Or any efficient way in pandas df.I have huge file with millions of rows, it is taking like forever to compute this.. Below is example:

d = {'A': [5, 9, 15, 0, 2], 'B': [7, 6, 3, 1, 4], 'C': [8, 11, 2, 5, 3]}
df = pd.DataFrame(data=d, index=['i1', 'i2', 'i3', 'i4', 'i5'])
print(df)
tmp = pd.DataFrame(columns=['A', 'B', 'C'])
row_pointer = 1
for index, row in df.iterrows():
    for i, r in df.iloc[row_pointer:].iterrows():
        tmp.loc['(' + index + ' - ' + i + ')'] = df.loc[index] - df.loc[i]
    row_pointer += 1
print(tmp)

Output:

 A  B   C
i1   5  7   8
i2   9  6  11
i3  15  3   2
i4   0  1   5
i5   2  4   3
         A   B   C
(i1 - i2)   -4   1  -3
(i1 - i3)  -10   4   6
(i1 - i4)    5   6   3
(i1 - i5)    3   3   5
(i2 - i3)   -6   3   9
(i2 - i4)    9   5   6
(i2 - i5)    7   2   8
(i3 - i4)   15   2  -3
(i3 - i5)   13  -1  -1
(i4 - i5)   -2  -3   2

Use broadcasting to do all of the subtractions. Then subset after with < to get all unique combinations.

import pandas as pd
import numpy as np

arr = df.to_numpy()
res = pd.DataFrame(np.vstack(arr[:, None]-arr), 
                   index=pd.MultiIndex.from_product([df.index, df.index]))
res = res[res.index.get_level_values(0) < res.index.get_level_values(1)]

print(res)

        0  1  2
i1 i2  -4  1 -3
   i3 -10  4  6
   i4   5  6  3
   i5   3  3  5
i2 i3  -6  3  9
   i4   9  5  6
   i5   7  2  8
i3 i4  15  2 -3
   i5  13 -1 -1
i4 i5  -2 -3  2

Likely not feasible with millions of rows. Alternatively remove one level of the loop:

from itertools import chain 

arr = df.to_numpy()
data = [(arr[i, None]-arr)[i+1:] for i in range(np.shape(arr)[0])]

idx = pd.MultiIndex.from_product([df.index, df.index])
idx = idx[idx.get_level_values(0) < idx.get_level_values(1)]

res = pd.DataFrame(chain.from_iterable(data), index=idx)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM