简体   繁体   中英

How to find the closest matching rows in between two dataframes that has no direct join columns?

For each set of coordinates in a pyspark dataframe, I need to find closest set of coordinates in another dataframe

I have one pyspark dataframe with coordinate data like so (dataframe a):

    +------------------+-------------------+
    |      latitude_deg|      longitude_deg|
    +------------------+-------------------+
    |    40.07080078125| -74.93360137939453|
    |         38.704022|        -101.473911|
    |       59.94919968|     -151.695999146|
    | 34.86479949951172| -86.77030181884766|
    |           35.6087|         -91.254898|
    |        34.9428028|        -97.8180194|

And another like so (dataframe b): (only few rows are shown for understanding)

    +-----+------------------+-------------------+
    |ident|      latitude_deg|      longitude_deg|
    +-----+------------------+-------------------+
    |  00A|    30.07080078125| -24.93360137939453|
    | 00AA|         56.704022|        -120.473911|
    | 00AK|       18.94919968|     -109.695999146|
    | 00AL| 76.86479949951172| -67.77030181884766|
    | 00AR|           10.6087|         -87.254898|
    | 00AS|        23.9428028|        -10.8180194|

Is it possible to somehow merge the dataframes to have a result that a has the closest ident from dataframe b for each row in dataframe a:

    +------------------+-------------------+-------------+
    |      latitude_deg|      longitude_deg|closest_ident|
    +------------------+-------------------+-------------+
    |    40.07080078125| -74.93360137939453|          12A|
    |         38.704022|        -101.473911|         14BC|
    |       59.94919968|     -151.695999146|         278A|
    | 34.86479949951172| -86.77030181884766|         56GH|
    |           35.6087|         -91.254898|         09HJ|
    |        34.9428028|        -97.8180194|         09BV|

What I have tried so far:

I have a pyspark UDF to calculate the haversine distance between 2 pairs of coordinates defined.

    udf_get_distance = F.udf(get_distance)

It works like this:

    df = (df.withColumn(“ABS_DISTANCE”, udf_get_distance(
         df.latitude_deg_a, df.longitude_deg_a,
         df.latitude_deg_b, df.longitude_deg_b,)
    ))

I'd appreciate any kind of help. Thanks so much

You need to do a crossJoin first. something like this

joined_df=source_df1.crossJoin(source_df2)

Then you can call your udf like you have mentioned, generate rownum based on distance and filter out the close one

from pyspark.sql.functions import row_number,Window
rwindow=Window.partitionBy("latitude_deg_a","latitude_deg_b").orderBy("ABS_DISTANCE")

udf_result_df = joined_df.withColumn(“ABS_DISTANCE”, udf_get_distance(
         df.latitude_deg_a, df.longitude_deg_a,
         df.latitude_deg_b, df.longitude_deg_b).withColumn("rownum",row_number().over(rwindow)).filter("rownum = 1")

Note: add return type to your udf

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM