简体   繁体   中英

alternative of pyspark inner join to compare two dataframes in pyspark

I have two dataframes in pyspark. As given below, df1 holds the entire long_lat which is coming from sensor. The second dataframe df2 is subset of the first dataframe where the lat-long value was rounded up-to 2 decimal and then removed duplicate to keep the unique lat_long data point.

df1:

+-----------------+---------+-----+--------------------+----------+------------+
|              UID|    label|value|            datetime|  latitude|   longitude|
+-----------------+---------+-----+--------------------+----------+------------+
|1B0545GD6546Y|evnt     | 3644|2020-06-08T23:32:...|40.1172005|-105.0823546|
|1B0545GD6FG67|evnt     | 3644|2020-06-08T23:32:...|40.1172201|-105.0821007|
|15GD6546YFG67|evnt     | 3644|2020-06-08T23:32:...|40.1172396|-105.0818468|
|1BGD6546YFG67|evnt     | 3644|2020-06-08T23:32:...|40.1172613|-105.0815929|
|1BGD6546YFG67|evnt     | 3644|2020-06-08T23:32:...|40.1172808|-105.0813368|
|1B054546YFG67|evnt     | 3644|2020-06-08T23:32:...|40.1173003|-105.0810742|
|1B056546YFG67|evnt     | 3644|2020-06-08T23:32:...| 40.117322|-105.0808073|

df2:

+-------+--------+----------------+--------------+                              
|new_lat|new_long|        lat_long|    State_name|
+-------+--------+----------------+--------------+
|  40.13|  -105.1|[40.13, -105.1] |      Colorado|
|  40.15| -105.11|[40.15, -105.11]|      Colorado|
|  40.12| -105.07|[40.12, -105.07]|      Colorado|
|  40.13| -104.99|[40.13, -104.99]|      Colorado|
|  40.15| -105.09|[40.15, -105.09]|      Colorado|
|  40.15| -105.13|[40.15, -105.13]|      Colorado|
|  40.12| -104.94|[40.12, -104.94]|      Colorado|

So, df2 has much less number of row than the first one. In the df2 I applied one udf to calculate the state name.

Now I want to populate the state name in df1. Since df2's lat_long value is rounded up-to 2 decimal, to match I am using a threshold like below, I am using a join operation here.

threshold = 0.01

df4 = df1.join(df2)\
        .filter(df2.new_lat - threshold < df1.latitude)\
        .filter(df1.latitude < df2.new_lat + threshold)

Is there any other efficient way to achieve the same? Because join operation is doing cartesian product and it's taking time and big number of task.

Consider, my df1 will have 1000 Billion records.

Any, help would be highly appreciated.

Whenever you join a big DataFrame with a smaller DataFrame, you should always try to perform a broadcast join .

If df2 is small enough to be broadcasted, then df1.join(broadcast(df2)) will be way more performant.

The second argument to the join() method should be the join condition.

def approx_equal(col1, col2, threshold):
    return abs(col1 - col2) < threshold

threshold = lit(0.01)

df4 = df1.join(broadcast(df2), approx_equal(df2.new_lat, df1.latitude, threshold) && approx_equal(df2.new_long, df1. longitude, threshold))

EDIT: I added the approx_equal function to quinn , so your code can be more concise:

import quinn as Q

threshold = lit(0.01)

df4 = df1.join(broadcast(df2), Q.approx_equal(df2.new_lat, df1.latitude, threshold) && Q.approx_equal(df2.new_long, df1. longitude, threshold))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM