简体   繁体   English

按最近的坐标合并数据帧

[英]Merge dataframes by closest coordinates

Imagine we have 2 dataframes with coordinates ['X','Y']:想象一下,我们有 2 个坐标为 ['X','Y'] 的数据框:

df1 : df1:

 X            Y          House №
2531        2016           175
2219        2196           11
2901        3426           201
6901        4431           46
7891        1126           89

df2 : df2:

 X            Y      Delivery office №
2534        2019            O1
6911        4421            O2
2901        3426            O3
7894.5      1120            O4 

My idea is to merge them and get:我的想法是合并它们并得到:

df3 df3

 X            Y          House №    Delivery office №
2531        2016           175            01
2219        2196           11             NA
2901        3426           201            03
6901        4431           46             02
7891        1126           89             04

So we wants to realise 'fuzzy' merge by threshold (this param should be given by user).所以我们想通过阈值来实现'模糊'合并(这个参数应该由用户给出)。 You can see that house number 11 didn't get any delivery office number because it located to much away from all of presented offices in df2.您可以看到门牌号 11 没有获得任何送货办公室编号,因为它离 df2 中所有呈现的办公室很远。

So I need all rows from df2 'find' it's closest row from df1 and add it's 'Cost' value to it You can see that usual in-box pd.merge do not work there as well as custom packages that realize fuzzy logic relates to string values using levenshtein distance and so on所以我需要来自 df2 的所有行“find”它是离 df1 最接近的行,并将它的“成本”值添加到它你可以看到通常的内置 pd.merge 在那里不起作用,以及实现模糊逻辑的自定义包与使用 levenshtein 距离等的字符串值

No silver bullet, but a way to do this is to turn the Y values in categories using pd.cut .没有灵丹妙药,但一种方法是使用pd.cut类别中的 Y 值。 Using this method, it will place the values in different bins.使用这种方法,它会将值放在不同的 bin 中。 You need to tune the bins manually, for example set it at 20.您需要手动调整 bin,例如将其设置为 20。

Load the data:加载数据:

df1 = pd.DataFrame({'X':[2531, 2219, 2901, 6901, 7891], 'Y':[2016, 2196, 3426, 4431, 1126], 'House':['A', 'B', 'J', 'A', 'A']})

df2 = pd.DataFrame({'X':[2534, 6911, 2901, 7894.5], 'Y':[2019, 4421, 3426, 1120], 'Cost':[1200, 3100, 800, 600]})

Make new categories:创建新类别:

df1['Y2'] = pd.cut(df1['Y'], 20, labels=False)

df2['Y2'] = pd.cut(df2['Y'], 20, labels=False)

df3 = pd.merge(df1, df2, on=['Y2'], how='left')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM