[英]Merge two dataframes based on nearest matches between pairs of column values
I am trying to merge two dataframes based on matches between pairs of column values.我正在尝试根据列值对之间的匹配来合并两个数据帧。 However, the column values are not exact from one dataframe to the next.
但是,从一个数据帧到下一个数据帧的列值并不准确。 The pairs are coordinates using the Swiss coordinate system, but measured from a slightly different reference point in each df.
这些对是使用瑞士坐标系的坐标,但从每个 df 中稍微不同的参考点测量。
This stackoverflow thread How to find the distance between 2 points in 2 different dataframes in pandas?这个stackoverflow线程如何在熊猫的2个不同数据帧中找到2个点之间的距离? seems to be a related query, but unfortunately I don't fully understand the response.
似乎是一个相关的查询,但不幸的是我不完全理解响应。
Example for my data:我的数据示例:
df1 = pd.DataFrame({'Ecode': [2669827.294, 2669634.483, 2669766.266, 2669960.683],
'Ncode': [1261034.528, 1262412.587, 1261209.646, 1262550.374],
'shape': ['square', 'square', 'triangle', 'circle']})
df1
Ecode Ncode shape
0 2669827.294 1261034.528 square
1 2669634.483 1262412.587 square
2 2669766.266 1261209.646 triangle
3 2669960.683 1262550.374 circle
df2 = pd.DataFrame({'CoorE': [2669636, 2669765, 2669827, 2669961],
'CoorN': [1262413, 1261211, 1261032, 1262550],
'color': ['purple', 'blue', 'blue', 'yellow']})
df2
CoorE CoorN color
0 2669636 1262413 purple
1 2669765 1261211 blue
2 2669827 1261032 blue
3 2669961 1262550 yellow
I have data I would like to compare located with both sets of coordinates (ex. 'shape' and 'color').我有我想与两组坐标(例如“形状”和“颜色”)进行比较的数据。 My desired outcome matches the column pairs on the closest match:
我想要的结果与最接近匹配的列对匹配:
CoorE CoorN color shape
0 2669636 1262413 purple square
1 2669765 1261211 blue triangle
2 2669827 1261032 blue square
3 2669961 1262550 yellow circle
Is there a way to do this?有没有办法做到这一点? I have tried to use merge_asof but realized it can't key on two variables.
我曾尝试使用 merge_asof 但意识到它不能键控两个变量。 I have also seen threads computing this based on latitude and longitude.
我还看到线程根据纬度和经度计算这个。 I can write a function that treats CoorE/CoorN and Ecode/Ncode as x/y coordinates, and calculates the distance between a pair of coordinates (probably there is a better way, but I am new to this):
我可以编写一个函数,将 CoorE/CoorN 和 Ecode/Ncode 视为 x/y 坐标,并计算一对坐标之间的距离(可能有更好的方法,但我是新手):
import math
def calculateDistance(x1,y1,x2,y2):
dist = math.sqrt((x2 - x1)**2 + (y2 - y1)**2)
return dist
print calculateDistance(x1, y1, x2, y2)
or something like this, but can't figure out how to use this kind of function to compare and match coordinate pairs from two separate dataframes based on least distance.或类似的东西,但无法弄清楚如何使用这种函数根据最小距离来比较和匹配来自两个单独数据帧的坐标对。 The real data set is also about 3 million entries, and I'm wondering what the least memory intensive way to do this would be.
实际数据集也大约有 300 万个条目,我想知道这样做的内存密集程度最低的方法是什么。
To use libraries to calculate distances you need to be on unified system.要使用库来计算距离,您需要使用统一系统。 From google I believe you are using epsg:21781
从谷歌我相信你正在使用epsg:21781
pyproj
pyproj
标准化坐标系geopy
geopy
计算这些之间的距离import pyproj, geopy.distance
df1 = pd.DataFrame({'Ecode': [2669827.294, 2669634.483, 2669766.266, 2669960.683],
'Ncode': [1261034.528, 1262412.587, 1261209.646, 1262550.374],
'shape': ['square', 'square', 'triangle', 'circle']})
df2 = pd.DataFrame({'CoorE': [2669636, 2669765, 2669827, 2669961],
'CoorN': [1262413, 1261211, 1261032, 1262550],
'color': ['purple', 'blue', 'blue', 'yellow']})
# assuming this co-ord system https://epsg.io/21781 then mapping to https://epsg.io/4326
sc = pyproj.Proj("epsg:21781")
dc = pyproj.Proj("epsg:4326")
df1 = df1.assign(
shape_gps=lambda x: x.apply(lambda r: pyproj.transform(sc, dc, r["Ecode"], r["Ncode"]), axis=1)
)
df2 = df2.assign(
color_gps=lambda x: x.apply(lambda r: pyproj.transform(sc, dc, r["CoorE"], r["CoorN"]), axis=1)
)
(df1
.assign(foo=1)
.merge(df2.assign(foo=1), on="foo")
.assign(distance=lambda x: x.apply(lambda r:
geopy.distance.geodesic(r["color_gps"], r["shape_gps"]).km, axis=1))
.sort_values("distance")
.groupby(["color","shape"]).agg({"distance":"first","CoorE":"first","CoorN":"first"})
)
If you pick a reference point to calculate distances, you get what you want.如果你选择一个参考点来计算距离,你就会得到你想要的。
import pyproj, geopy.distance
df1 = pd.DataFrame({'Ecode': [2669827.294, 2669634.483, 2669766.266, 2669960.683],
'Ncode': [1261034.528, 1262412.587, 1261209.646, 1262550.374],
'shape': ['square', 'square', 'triangle', 'circle']})
df2 = pd.DataFrame({'CoorE': [2669636, 2669765, 2669827, 2669961],
'CoorN': [1262413, 1261211, 1261032, 1262550],
'color': ['purple', 'blue', 'blue', 'yellow']})
# assuming this co-ord system https://epsg.io/21781 then mapping to https://epsg.io/4326
sc = pyproj.Proj("epsg:21781")
dc = pyproj.Proj("epsg:4326")
# pick a reference point for use in diatnace calcs
refpoint = pyproj.transform(sc, dc, df1.loc[0,["Ecode"]][0], df1.loc[0,["Ncode"]][0])
df1 = df1.assign(
shape_gps=lambda x: x.apply(lambda r: pyproj.transform(sc, dc, r["Ecode"], r["Ncode"]), axis=1),
distance=lambda x: x.apply(lambda r: geopy.distance.geodesic(refpoint, r["shape_gps"]).km, axis=1),
).sort_values("distance")
df2 = df2.assign(
color_gps=lambda x: x.apply(lambda r: pyproj.transform(sc, dc, r["CoorE"], r["CoorN"]), axis=1),
distance=lambda x: x.apply(lambda r: geopy.distance.geodesic(refpoint, r["color_gps"]).km, axis=1),
).sort_values("distance")
# no cleanup of columns but this works
pd.merge_asof(df1, df2, on="distance", direction="nearest")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.