[英]Pyspark dataframes left join with conditions (spatial join)
我使用 pyspark 并创建了(从 txt 文件)两个数据帧
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
import pandas as pd
sc = spark.sparkContext
+---+--------------------+------------------+-------------------+
| id| name| lat| lon|
+---+--------------------+------------------+-------------------+
| 1|.
.
.
+---+-------------------+------------------+-------------------+
| id| name| lat| lon|
+---+-------------------+------------------+-------------------+
| 1||
.
.
我想要的是,通过 Spark 技术,获得欧几里德距离低于某个值(假设为“0.5”)的 Dataframes 项之间的每一对。 喜欢:
record1, record2
或者像这样的任何形式,这都不是问题。
任何帮助将不胜感激,谢谢。
由于 Spark 不包括任何地理空间计算规定,因此您需要一个用户定义的函数来计算两点之间的地理空间距离,例如使用半正弦公式(来自此处):
from math import radians, cos, sin, asin, sqrt
from pyspark.sql.functions import udf
from pyspark.sql.types import FloatType
@udf(returnType=FloatType())
def haversine(lat1, lon1, lat2, lon2):
R = 6372.8
dLat = radians(lat2 - lat1)
dLon = radians(lon2 - lon1)
lat1 = radians(lat1)
lat2 = radians(lat2)
a = sin(dLat/2)**2 + cos(lat1)*cos(lat2)*sin(dLon/2)**2
c = 2*asin(sqrt(a))
return R * c
然后,您只需根据调用haversine()
的结果执行交叉连接:
df1.join(df2, haversine(df1.lat, df1.lon, df2.lat, df2.lon) < 100, 'cross') \
.select(df1.name, df2.name)
您需要交叉联接,因为 Spark 无法在联接本身中嵌入 Python UDF。 这很昂贵,但这是 PySpark 用户必须忍受的。
下面是一个例子:
>>> df1.show()
+---------+-------------------+--------------------+
| lat| lon| name|
+---------+-------------------+--------------------+
|37.776181|-122.41341399999999|AAE SSFF European...|
|38.959716| -119.945595|Ambassador Motor ...|
| 37.66169| -121.887367|Alameda County Fa...|
+---------+-------------------+--------------------+
>>> df2.show()
+------------------+-------------------+-------------------+
| lat| lon| name|
+------------------+-------------------+-------------------+
| 34.19198813|-118.93756299999998|Daphnes Greek Cafe1|
| 37.755557|-122.25036084651899|Daphnes Greek Cafe2|
|38.423435999999995| -121.41361| Laguna Pizza|
+------------------+-------------------+-------------------+
>>> df1.join(df2, haversine(df1.lat, df1.lon, df2.lat, df2.lon) < 100, 'cross') \
.select(df1.name.alias("name1"), df2.name.alias("name2")).show()
+--------------------+-------------------+
| name1| name2|
+--------------------+-------------------+
|AAE SSFF European...|Daphnes Greek Cafe2|
|Alameda County Fa...|Daphnes Greek Cafe2|
|Alameda County Fa...| Laguna Pizza|
+--------------------+-------------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.