简体   繁体   English

根据列值对之间的最近匹配合并两个数据帧

[英]Merge two dataframes based on nearest matches between pairs of column values

I am trying to merge two dataframes based on matches between pairs of column values.我正在尝试根据列值对之间的匹配来合并两个数据帧。 However, the column values are not exact from one dataframe to the next.但是,从一个数据帧到下一个数据帧的列值并不准确。 The pairs are coordinates using the Swiss coordinate system, but measured from a slightly different reference point in each df.这些对是使用瑞士坐标系的坐标,但从每个 df 中稍微不同的参考点测量。

This stackoverflow thread How to find the distance between 2 points in 2 different dataframes in pandas?这个stackoverflow线程如何在熊猫的2个不同数据帧中找到2个点之间的距离? seems to be a related query, but unfortunately I don't fully understand the response.似乎是一个相关的查询,但不幸的是我不完全理解响应。

Example for my data:我的数据示例:

df1 = pd.DataFrame({'Ecode': [2669827.294, 2669634.483, 2669766.266, 2669960.683],
                    'Ncode': [1261034.528, 1262412.587, 1261209.646, 1262550.374],
                    'shape': ['square', 'square', 'triangle', 'circle']})

df1
     Ecode            Ncode          shape
0   2669827.294     1261034.528     square
1   2669634.483     1262412.587     square
2   2669766.266     1261209.646     triangle
3   2669960.683     1262550.374     circle


df2 = pd.DataFrame({'CoorE': [2669636, 2669765, 2669827, 2669961],
                    'CoorN': [1262413, 1261211, 1261032, 1262550],
                    'color': ['purple', 'blue', 'blue', 'yellow']})

df2
     CoorE       CoorN      color
0   2669636     1262413     purple
1   2669765     1261211     blue
2   2669827     1261032     blue
3   2669961     1262550     yellow

I have data I would like to compare located with both sets of coordinates (ex. 'shape' and 'color').我有我想与两组坐标(例如“形状”和“颜色”)进行比较的数据。 My desired outcome matches the column pairs on the closest match:我想要的结果与最接近匹配的列对匹配:

     CoorE       CoorN      color   shape
0   2669636     1262413     purple  square
1   2669765     1261211     blue    triangle
2   2669827     1261032     blue    square
3   2669961     1262550     yellow  circle

Is there a way to do this?有没有办法做到这一点? I have tried to use merge_asof but realized it can't key on two variables.我曾尝试使用 merge_asof 但意识到它不能键控两个变量。 I have also seen threads computing this based on latitude and longitude.我还看到线程根据纬度和经度计算这个。 I can write a function that treats CoorE/CoorN and Ecode/Ncode as x/y coordinates, and calculates the distance between a pair of coordinates (probably there is a better way, but I am new to this):我可以编写一个函数,将 CoorE/CoorN 和 Ecode/Ncode 视为 x/y 坐标,并计算一对坐标之间的距离(可能有更好的方法,但我是新手):

import math  
def calculateDistance(x1,y1,x2,y2):  
     dist = math.sqrt((x2 - x1)**2 + (y2 - y1)**2)  
     return dist  
print calculateDistance(x1, y1, x2, y2)

or something like this, but can't figure out how to use this kind of function to compare and match coordinate pairs from two separate dataframes based on least distance.或类似的东西,但无法弄清楚如何使用这种函数根据最小距离来比较和匹配来自两个单独数据帧的坐标对。 The real data set is also about 3 million entries, and I'm wondering what the least memory intensive way to do this would be.实际数据集也大约有 300 万个条目,我想知道这样做的内存密集程度最低的方法是什么。

To use libraries to calculate distances you need to be on unified system.要使用库来计算距离,您需要使用统一系统。 From google I believe you are using epsg:21781从谷歌我相信你正在使用epsg:21781

  1. first standardise co-ordinate system using pyproj首先使用pyproj标准化坐标系
  2. do a Cartesian product of colors and shapes颜色形状的笛卡尔积
  3. calculate distance between these using geopy使用geopy计算这些之间的距离
  4. you can now select out resulting rows that you want.您现在可以选择所需的结果行。 For purpose of example I've taken nearest when groups by color and shape例如,当按颜色形状分组时,我采用了最近的方法
import pyproj, geopy.distance
df1 = pd.DataFrame({'Ecode': [2669827.294, 2669634.483, 2669766.266, 2669960.683],
                    'Ncode': [1261034.528, 1262412.587, 1261209.646, 1262550.374],
                    'shape': ['square', 'square', 'triangle', 'circle']})
df2 = pd.DataFrame({'CoorE': [2669636, 2669765, 2669827, 2669961],
                    'CoorN': [1262413, 1261211, 1261032, 1262550],
                    'color': ['purple', 'blue', 'blue', 'yellow']})


# assuming this co-ord system https://epsg.io/21781 then mapping to https://epsg.io/4326
sc = pyproj.Proj("epsg:21781")
dc = pyproj.Proj("epsg:4326")

df1 = df1.assign(
    shape_gps=lambda x: x.apply(lambda r: pyproj.transform(sc, dc, r["Ecode"], r["Ncode"]), axis=1)
)
df2 = df2.assign(
    color_gps=lambda x: x.apply(lambda r: pyproj.transform(sc, dc, r["CoorE"], r["CoorN"]), axis=1)
)

(df1
     .assign(foo=1)
     .merge(df2.assign(foo=1), on="foo")
     .assign(distance=lambda x: x.apply(lambda r: 
                                        geopy.distance.geodesic(r["color_gps"], r["shape_gps"]).km, axis=1))
     .sort_values("distance")
 .groupby(["color","shape"]).agg({"distance":"first","CoorE":"first","CoorN":"first"})
)

updated for nearest merge更新为最近的合并

If you pick a reference point to calculate distances, you get what you want.如果你选择一个参考点来计算距离,你就会得到你想要的。

import pyproj, geopy.distance
df1 = pd.DataFrame({'Ecode': [2669827.294, 2669634.483, 2669766.266, 2669960.683],
                    'Ncode': [1261034.528, 1262412.587, 1261209.646, 1262550.374],
                    'shape': ['square', 'square', 'triangle', 'circle']})
df2 = pd.DataFrame({'CoorE': [2669636, 2669765, 2669827, 2669961],
                    'CoorN': [1262413, 1261211, 1261032, 1262550],
                    'color': ['purple', 'blue', 'blue', 'yellow']})


# assuming this co-ord system https://epsg.io/21781 then mapping to https://epsg.io/4326
sc = pyproj.Proj("epsg:21781")
dc = pyproj.Proj("epsg:4326")
# pick a reference point for use in diatnace calcs
refpoint = pyproj.transform(sc, dc, df1.loc[0,["Ecode"]][0], df1.loc[0,["Ncode"]][0])

df1 = df1.assign(
    shape_gps=lambda x: x.apply(lambda r: pyproj.transform(sc, dc, r["Ecode"], r["Ncode"]), axis=1),
    distance=lambda x: x.apply(lambda r: geopy.distance.geodesic(refpoint, r["shape_gps"]).km, axis=1),
).sort_values("distance")
df2 = df2.assign(
    color_gps=lambda x: x.apply(lambda r: pyproj.transform(sc, dc, r["CoorE"], r["CoorN"]), axis=1),
    distance=lambda x: x.apply(lambda r: geopy.distance.geodesic(refpoint, r["color_gps"]).km, axis=1),
).sort_values("distance")

# no cleanup of columns but this works
pd.merge_asof(df1, df2, on="distance", direction="nearest")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM