pandas - merge rows based on column meeting a condition

Question

I'm new to pandas and I don't know the best way to do this.

I have two files which I've placed in two different dataframes:

>> frame1.head()
Out[64]:

    Date and Time           Sample  Unnamed: 2
0   05/18/2017 08:38:37:490 163.7   NaN
1   05/18/2017 08:39:37:490 164.5   NaN
2   05/18/2017 08:40:37:490 148.7   NaN
3   05/18/2017 08:41:37:490 111.2   NaN
4   05/18/2017 08:42:37:490 83.6    NaN


>>frame2.head()
Out[66]:
Date and Time               Sample  Unnamed: 2
0   05/18/2017 08:38:38:490 7.5 NaN
1   05/18/2017 08:39:38:490 7.5 NaN
2   05/18/2017 08:40:38:490 7.5 NaN
3   05/18/2017 08:41:38:490 7.5 NaN
4   05/18/2017 08:42:38:490 7.5 NaN

I need to "merge" any row from frame 1, with any row in frame 2, that are within one second of each other.

For example, this row from frame 1:

0   05/18/2017 08:38:37:490 163.7   NaN

is within one second of this row from frame 2:

0   05/18/2017 08:38:38:490 7.5 NaN

So when they are "merged" The output should be like this:

0   05/18/2017 08:38:37:490 163.7 7.5 NaN NaN

in other words, one row has it's time replaced by the other, and the all of the remaining columns are just appended

The closest I've come up with is to do something like:

    d3 = pd.merge(frame1, frame2, on='Date and Time (MM/DD/YYYY HH:MM:SS:sss)', how='outer')

>>d3.head()
    Date and Time           Sample_x    Unnamed: 2_x    Sample_y    Unnamed: 2_y
0   05/18/2017 08:38:37:490 163.7   NaN NaN NaN
1   05/18/2017 08:39:37:490 164.5   NaN NaN NaN
2   05/18/2017 08:40:37:490 148.7   NaN NaN NaN
3   05/18/2017 08:41:37:490 111.2   NaN NaN NaN
4   05/18/2017 08:42:37:490 83.6    NaN NaN NaN

But, that isn't a conditional merge .. .I need to merge if they are within one second of each other, not just exactly the same.

I know I can compare the times with something like:

def compare_time(temp, sec=1):
   return abs(current - temp) <= datetime.timedelta(seconds=sec)

then use .apply() or something... but I have no idea how to piece all this together

EDIT: it looks like pd.merge_asof does a good job, but I also need to retain the lines that aren't matched / merged in the final frame as well

EDIT 2:

df1 = pd.DataFrame({ 'datetime':pd.date_range('1-1-2017', periods= 4,freq='s'),
                     'sample':  np.arange(4)+100 })
df2 = pd.DataFrame({ 'datetime':pd.date_range('1-1-2017', periods=4,freq='300ms'),
                     'sample':  np.arange(4) })

blah = pd.merge_asof( df2, df1, on='datetime', tolerance=pd.Timedelta('1s') )  \
    .append(df1.rename(columns={'sample':'sample_x'})).drop_duplicates('sample_x')
blah

returns:

    datetime    sample_x    sample_y
0   2017-01-01 00:00:00.000 0   100.0
1   2017-01-01 00:00:00.300 1   100.0
2   2017-01-01 00:00:00.600 2   100.0
3   2017-01-01 00:00:00.900 3   100.0
0   2017-01-01 00:00:00.000 100 NaN
1   2017-01-01 00:00:01.000 101 NaN
2   2017-01-01 00:00:02.000 102 NaN
3   2017-01-01 00:00:03.000 103 NaN

Notice it's retaining the original row indexes (zero is listed twice)..

Answer 1

You can use merge_asof as @Wen suggests, but be sure to specify the optional value for tolerance . Also consider the setting the option value for the direction of your match which can be 'backward' (default), 'nearest', or 'forward'.

pd.merge_asof( df1, df2, on='datetime', tolerance=pd.Timedelta('1s') )

Here's a longer explanation with sample data (Note I'm just creating new sample data since I can only see the first few rows of your actual data):

df1 = pd.DataFrame({ 'datetime':pd.date_range('1-1-2017', periods= 4,freq='s'),
                     'sample':  np.arange(4)+100 })
df2 = pd.DataFrame({ 'datetime':pd.date_range('1-1-2017', periods=4,freq='300ms'),
                     'sample':  np.arange(4) })

df1
Out[208]: 
             datetime  sample
0 2017-01-01 00:00:00     100
1 2017-01-01 00:00:01     101
2 2017-01-01 00:00:02     102
3 2017-01-01 00:00:03     103

df2
Out[209]: 
                 datetime  sample
0 2017-01-01 00:00:00.000       0
1 2017-01-01 00:00:00.300       1
2 2017-01-01 00:00:00.600       2
3 2017-01-01 00:00:00.900       3

pd.merge_asof( df1, df2, on='datetime', tolerance=pd.Timedelta('1s') )
Out[210]: 
             datetime  sample_x  sample_y
0 2017-01-01 00:00:00       100       0.0
1 2017-01-01 00:00:01       101       3.0
2 2017-01-01 00:00:02       102       NaN
3 2017-01-01 00:00:03       103       NaN

Note that merge_asof does a left join so you can get a different answer by changing the order of df1 & df2:

pd.merge_asof( df2, df1, on='datetime', tolerance=pd.Timedelta('1s') )
Out[218]: 
                 datetime  sample_x  sample_y
0 2017-01-01 00:00:00.000         0       100
1 2017-01-01 00:00:00.300         1       100
2 2017-01-01 00:00:00.600         2       100
3 2017-01-01 00:00:00.900         3       100

Edit to add: the docs say merge_asof does a left join by design but it seems to differ from a true left join in that it excludes rows in the left dataframe that don't match. To fix that you could do something like this:

pd.merge_asof( df1, df2, on='datetime', tolerance=pd.Timedelta('1s') )  \
    .append(df1.rename(columns={'sample':'sample_x'})).drop_duplicates('sample_x')
Out[236]: 
             datetime  sample_x  sample_y
0 2017-01-01 00:00:00       100       0.0
1 2017-01-01 00:00:01       101       3.0
2 2017-01-01 00:00:02       102       NaN
3 2017-01-01 00:00:03       103       NaN

Note that you may need to adjust drop_duplicates based on whether or not you have a unique index and/or unique columns.

pandas - merge rows based on column meeting a condition

Question

1 answers

solution1
1 ACCPTED 2017-09-05 17:14:34

pandas - merge rows based on column meeting a condition

Question

1 answers

solution1 1 ACCPTED 2017-09-05 17:14:34

solution1
1 ACCPTED 2017-09-05 17:14:34