简体   繁体   中英

Pandas: How to combine two dataframes by closest index match?

I've got two dataframes df1, df2 with indexes of the same type, but with few, if any, identical matches. Indexes may also have duplicaes. Columns A and B will consist of internally unique values. All indexes and columns are ordered, but not in the same direction. df1.index is descdending and df1['A'] is ascdending. df2.index is ascending and df2['B'] is descending.

df1: (numbers to the left are the unnamed indexes of the dataframes)

            A
80 -13.545215
76 -12.270691
73 -11.274724
65  -8.280187
38  -7.965972
13  -7.788130
10  -6.690969
6   -5.273063

df2:

            B
8  -13.827641
10 -12.283885
14 -11.459951
62 -11.067622
64 -10.745988
87 -10.661594
95  -9.816053
97  -7.740810

I'd like to combine the dataframes such that the values in df2['B'] are placed to the nearest corresponding index from df2 in df1 , so that the desired output takes the form:

            B         A
8  -13.827641 -6.690969
10 -12.283885 -6.690969
14 -11.459951 -7.965972
62 -11.067622 -8.280187
64 -10.745988 -8.280187
87 -10.661594  NaN
95  -9.816053  NaN
97  -7.740810  NaN

If the closest index A in absolute terms is lower than index B, then the upper value of index A is the correct match. If index B has no corresponding match in index A that is higher, then NaN is the correct match.

So far, I've used pd.merge() and fillna() to make necessary analyses. But some may find it "unnatural" to make analyses on interpolated / synthetic data. Anyway, whis is how I've been doing it:

Partial code sample for pd.merge() and dropna():

# outer merge
df3 = pd.merge(df1,df2, how = 'outer', left_index = True, right_index = True)
#df4 = df3.interpolate(method = 'linear')[1:]
df4 = df3.interpolate(method = 'linear').dropna()
df4

Output:

            A          B
8   -5.982016 -13.827641
10  -6.690969 -12.283885
13  -7.788130 -11.871918
14  -7.877051 -11.459951
38  -7.965972 -11.263787
62  -8.070710 -11.067622
64  -8.175448 -10.745988
65  -8.280187 -10.729109
73 -11.274724 -10.712230
76 -12.270691 -10.695352
80 -13.545215 -10.678473
87 -13.545215 -10.661594
95 -13.545215  -9.816053
97 -13.545215  -7.740810

Plot:

在此处输入图片说明

Complete data and code sample

#imports
import numpy as np
import pandas as pd

# Some sample data
np.random.seed(1)
df1_index = sorted(np.random.randint(1,101,8), reverse = True)
df1info = {'A':sorted((np.random.normal(10, 2, 8))*-1)}

df2_index = sorted(np.random.randint(1,101,8))
df2info = {'B':sorted(np.random.normal(10, 2, 8)*-1)}

# Two dataframes
df1 = pd.DataFrame(df1info, index = df1_index)
df2 = pd.DataFrame(df2info, index = df2_index)

# outer merge
df3 = pd.merge(df1,df2, how = 'outer', left_index = True, right_index = True)

# interpolate missing values
df4 = df3.interpolate(method = 'linear').dropna()

# plot
df4.plot()

Thank you for any suggestions!

Edit 1: Duplicate scenario 1 :

If df2.index has an exact match in df1.index , and df1.index has a duplicate, then the correct match is the lowest df1.index . I hope that makes sense. If it turns out to be nonsensical for some reason, I'm open to other suggestions!

Not "Pythonic" but a O(n) solution

df2_index.sort()
df1_index.sort()

a = 0
b = 0
mapping = [[],[]]
while b < len(df2_index) and a < len(df1_index):
    if df1_index[a] == df2_index[b]:
        mapping[0].append(df2_index[b])
        mapping[1].append(df1.loc[df1_index[a], "A"]) 
        b += 1
        a += 1
    elif df1_index[a] > df2_index[b]:
        mapping[0].append(df2_index[b])
        mapping[1].append(df1.loc[df1_index[a], "A"])           
        b += 1
    else:
        a += 1

df = pd.DataFrame({'A': mapping[1]}, index = mapping[0])
df2.merge(df, left_index=True, right_index=True, how='outer')

Output

     B              A
8   -13.827641  -6.690969
10  -12.283885  -6.690969
14  -11.459951  -7.965972
62  -11.067622  -8.280187
64  -10.745988  -8.280187
87  -10.661594  NaN
95  -9.816053   NaN
97  -7.740810   NaN
  • Both the indices are sorted in ascending order
  • b points to B's and a points to A's index
  • At any point of time give a b we find the next maximum a and save it when we find it
  • if b == a then we are done with those records so we move ahead
  • if a > b then we fond the value of b so we move b. We dont move a because this can also be a candidate for next b
  • if a < b we move a because the candidate for b will be somewhere after current a because it is sorted.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM