pyspark 内部连接的替代方案，用于比较 pyspark 中的两个数据帧

Question

I have two dataframes in pyspark.我在 pyspark 中有两个数据框。 As given below, df1 holds the entire long_lat which is coming from sensor.如下所示，df1 包含来自传感器的整个 long_lat。 The second dataframe df2 is subset of the first dataframe where the lat-long value was rounded up-to 2 decimal and then removed duplicate to keep the unique lat_long data point.第二个 dataframe df2 是第一个 dataframe 的子集，其中 lat-long 值四舍五入到小数点后 2 位，然后删除重复项以保留唯一的 lat_long 数据点。

df1: df1：

+-----------------+---------+-----+--------------------+----------+------------+
|              UID|    label|value|            datetime|  latitude|   longitude|
+-----------------+---------+-----+--------------------+----------+------------+
|1B0545GD6546Y|evnt     | 3644|2020-06-08T23:32:...|40.1172005|-105.0823546|
|1B0545GD6FG67|evnt     | 3644|2020-06-08T23:32:...|40.1172201|-105.0821007|
|15GD6546YFG67|evnt     | 3644|2020-06-08T23:32:...|40.1172396|-105.0818468|
|1BGD6546YFG67|evnt     | 3644|2020-06-08T23:32:...|40.1172613|-105.0815929|
|1BGD6546YFG67|evnt     | 3644|2020-06-08T23:32:...|40.1172808|-105.0813368|
|1B054546YFG67|evnt     | 3644|2020-06-08T23:32:...|40.1173003|-105.0810742|
|1B056546YFG67|evnt     | 3644|2020-06-08T23:32:...| 40.117322|-105.0808073|

df2: df2:

+-------+--------+----------------+--------------+                              
|new_lat|new_long|        lat_long|    State_name|
+-------+--------+----------------+--------------+
|  40.13|  -105.1|[40.13, -105.1] |      Colorado|
|  40.15| -105.11|[40.15, -105.11]|      Colorado|
|  40.12| -105.07|[40.12, -105.07]|      Colorado|
|  40.13| -104.99|[40.13, -104.99]|      Colorado|
|  40.15| -105.09|[40.15, -105.09]|      Colorado|
|  40.15| -105.13|[40.15, -105.13]|      Colorado|
|  40.12| -104.94|[40.12, -104.94]|      Colorado|

So, df2 has much less number of row than the first one.因此，df2 的行数比第一个少得多。 In the df2 I applied one udf to calculate the state name.在 df2 中，我应用了一个 udf 来计算 state 名称。

Now I want to populate the state name in df1.现在我想在 df1 中填充 state 名称。 Since df2's lat_long value is rounded up-to 2 decimal, to match I am using a threshold like below, I am using a join operation here.由于 df2 的 lat_long 值被四舍五入到小数点后 2，为了匹配我使用如下阈值，我在这里使用连接操作。

threshold = 0.01

df4 = df1.join(df2)\
        .filter(df2.new_lat - threshold < df1.latitude)\
        .filter(df1.latitude < df2.new_lat + threshold)

Is there any other efficient way to achieve the same?有没有其他有效的方法来实现同样的目标？ Because join operation is doing cartesian product and it's taking time and big number of task.因为连接操作是做笛卡尔积，它需要时间和大量的任务。

Consider, my df1 will have 1000 Billion records.考虑一下，我的 df1 将有 10000 亿条记录。

Any, help would be highly appreciated.任何，帮助将不胜感激。

Answer 1

Whenever you join a big DataFrame with a smaller DataFrame, you should always try to perform a broadcast join .每当您将大 DataFrame 与较小的 DataFrame 连接时，您应该始终尝试执行广播连接。

If df2 is small enough to be broadcasted, then df1.join(broadcast(df2)) will be way more performant.如果df2小到可以广播，那么df1.join(broadcast(df2))的性能会更好。

The second argument to the join() method should be the join condition. join()方法的第二个参数应该是连接条件。

def approx_equal(col1, col2, threshold):
    return abs(col1 - col2) < threshold

threshold = lit(0.01)

df4 = df1.join(broadcast(df2), approx_equal(df2.new_lat, df1.latitude, threshold) && approx_equal(df2.new_long, df1. longitude, threshold))

EDIT: I added the approx_equal function to quinn , so your code can be more concise:编辑：我将approx_equal function 添加到quinn ，因此您的代码可以更简洁：

import quinn as Q

threshold = lit(0.01)

df4 = df1.join(broadcast(df2), Q.approx_equal(df2.new_lat, df1.latitude, threshold) && Q.approx_equal(df2.new_long, df1. longitude, threshold))

pyspark 内部连接的替代方案，用于比较 pyspark 中的两个数据帧

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-07-21 17:59:21

pyspark 内部连接的替代方案，用于比较 pyspark 中的两个数据帧

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-07-21 17:59:21

解决方案1
1 已采纳 2020-07-21 17:59:21