简体   繁体   中英

Pyspark dataframes left join with conditions (spatial join)

I use pyspark and I have created (from txt files) two dataframes

import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
import pandas as pd
sc = spark.sparkContext
+---+--------------------+------------------+-------------------+
| id|                name|               lat|                lon|
+---+--------------------+------------------+-------------------+
|  1|.
.
.
+---+-------------------+------------------+-------------------+
| id|               name|               lat|                lon|
+---+-------------------+------------------+-------------------+
|  1||
.
.

What I want is, through Spark techniques, to get every pair between the items of the Dataframes where their euclidean distance is below a certain value (let's say "0.5"). Like:

record1, record2

or in any form like this, this is not the matter.

Any help will be appreciated, thank you.

Since Spark does not include any provisions for geospatial computations, you need a user-defined function that computes the geospatial distance between two points, for example by using the haversine formula (from here ):

from math import radians, cos, sin, asin, sqrt
from pyspark.sql.functions import udf
from pyspark.sql.types import FloatType

@udf(returnType=FloatType())
def haversine(lat1, lon1, lat2, lon2):
    R = 6372.8

    dLat = radians(lat2 - lat1)
    dLon = radians(lon2 - lon1)
    lat1 = radians(lat1)
    lat2 = radians(lat2)

    a = sin(dLat/2)**2 + cos(lat1)*cos(lat2)*sin(dLon/2)**2
    c = 2*asin(sqrt(a))

    return R * c

Then you simply perform a cross join conditioned on the result from calling haversine() :

df1.join(df2, haversine(df1.lat, df1.lon, df2.lat, df2.lon) < 100, 'cross') \
   .select(df1.name, df2.name)

You need a cross join since Spark cannot embed the Python UDF in the join itself. That's expensive, but this is something that PySpark users have to live with.

Here is an example:

>>> df1.show()
+---------+-------------------+--------------------+
|      lat|                lon|                name|
+---------+-------------------+--------------------+
|37.776181|-122.41341399999999|AAE SSFF European...|
|38.959716|        -119.945595|Ambassador Motor ...|
| 37.66169|        -121.887367|Alameda County Fa...|
+---------+-------------------+--------------------+
>>> df2.show()
+------------------+-------------------+-------------------+
|               lat|                lon|               name|
+------------------+-------------------+-------------------+
|       34.19198813|-118.93756299999998|Daphnes Greek Cafe1|
|         37.755557|-122.25036084651899|Daphnes Greek Cafe2|
|38.423435999999995|         -121.41361|       Laguna Pizza|
+------------------+-------------------+-------------------+
>>> df1.join(df2, haversine(df1.lat, df1.lon, df2.lat, df2.lon) < 100, 'cross') \
       .select(df1.name.alias("name1"), df2.name.alias("name2")).show()
+--------------------+-------------------+
|               name1|              name2|
+--------------------+-------------------+
|AAE SSFF European...|Daphnes Greek Cafe2|
|Alameda County Fa...|Daphnes Greek Cafe2|
|Alameda County Fa...|       Laguna Pizza|
+--------------------+-------------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM