How to efficiently find out the common elements in two dataframes, with the minimal time interval?

Question

Suppose I have two dataframes, write and read

w:

time     　　　　　　　　　　　address  
2018-01-01 00:00:00　　　　8  
2018-01-01 01:00:00　　　　2  
2018-01-01 02:00:00　　　　5  
2018-01-01 03:00:00　　　　3  
2018-01-01 04:00:00　　　　4  
2018-01-01 04:30:00　　　　5  
2018-01-01 06:00:00　　　　5

r:

time         　　　　　　　　　　　address  
2018-01-01 00:00:00　　　　    3  
2018-01-01 01:00:00　    　　　1  
2018-01-01 03:00:00　　　　    6  
2018-01-01 04:00:00　　　　    3  
2018-01-01 05:00:00　　　　    5

The time is formated by pd.to_datetime, format = '%Y-%m-%d %H:%M:%S'

For each read address, I want to get the time interval (by seconds) between the read address and its last write address(write should come before read). If not found, assign -1
For this example, I want to get [-1, -1, -1, 3600, 1800]

For each read, I try to find the proper write address in w reversely, but it's rather slow, is there any efficient way to do this?Or should I use another data structure rather than pandas dataframe to do this?

My code is as below:

def time_calcu(w, r):
    time_deltas = []
    for i in range(len(r)):
        for j in range(len(w) - 1, -1, -1):
            if r.iloc[i, 1] == w.iloc[j, 1] and r.iloc[i, 0] > w.iloc[j, 0]:
                t0_t1 = (r.iloc[i, 0] - w.iloc[j, 0]).total_seconds()
                time_deltas.append(t0_t1)
                break
            elif j == 0 :
                time_deltas.append(-1)


    return time_deltas

Answer 1

Rename columns

r = r.rename(columns={'time': 'read'})
w = w.rename(columns={'time': 'write'})

Use merge_asof

m = pd.merge_asof(r, w, left_on='read', right_on='write', by='address')
m.read.sub(m.write).dt.total_seconds().fillna(-1)

0      -1.0
1      -1.0
2      -1.0
3    3600.0
4    1800.0
dtype: float64

How to efficiently find out the common elements in two dataframes, with the minimal time interval?

Question

1 answers

solution1
1 2019-07-06 03:58:32

How to efficiently find out the common elements in two dataframes, with the minimal time interval?

Question

1 answers

solution1 1 2019-07-06 03:58:32

solution1
1 2019-07-06 03:58:32