简体   繁体   English

根据使用 Spark / Java 的条件加入两个 dataframe

[英]Join two dataframe based on a condition using Spark / Java

I have 3 dataframes on spark: dataframe1, dataframe2 and dataframe3.我在 spark 上有 3 个数据帧:dataframe1、dataframe2 和 dataframe3。

I want to join dataframe1 with an other dataframe based on a condition.我想根据条件将 dataframe1 与其他 dataframe 加入。

I use the following code:我使用以下代码:

Dataset <Row> df= dataframe1.filter(when(col("diffDate").lt(3888),dataframe1.join(dataframe2,
            dataframe2.col("id_device").equalTo(dataframe1.col("id_device")).
            and(dataframe2.col("id_vehicule").equalTo(dataframe1.col("id_vehicule"))).
            and(dataframe2.col("tracking_time").lt(dataframe1.col("tracking_time")))).orderBy(dataframe2.col("tracking_time").desc())).
                   otherwise(dataframe1.join(dataframe3,
                   dataframe3.col("id_device").equalTo(dataframe1.col("id_device")).
                           and(dataframe3.col("id_vehicule").equalTo(dataframe1.col("id_vehicule"))).
                           and(dataframe3.col("tracking_time").lt(dataframe1.col("tracking_time")))).orderBy(dataframe3.col("tracking_time").desc())));

But I get this exception但我得到了这个例外

Exception in thread "main" java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.Dataset

EDIT编辑

Input dataframes:输入数据框:

dataframe1数据框1

+-----------+-------------+-------------+-------------+
| diffDate  |id_device    |id_vehicule  |tracking_time|
+-----------+-------------+-------------+-------------+
|222        |1            |5            |2020-05-30   |          
|4700       |8            |9            |2019-03-01   |
+-----------+-------------+-------------+-------------+

dataframe2数据框2

+-----------+-------------+-------------+-------------+
|id_device  |id_vehicule  |tracking_time|longitude    |
+-----------+-------------+-------------+-------------+
|1          |5            |2020-05-12   | 33.21111    |       
|8          |9            |2019-03-01   |20.2222      |
+-----------+-------------+-------------+-------------+

dataframe3数据框3

+-----------+-------------+-------------+-------------+
|id_device  |id_vehicule  |tracking_time|latitude     |
+-----------+-------------+-------------+-------------+
|1          |5            |2020-05-12   | 40.333      |       
|8          |9            |2019-02-28   |2.00000      |
+-----------+-------------+-------------+-------------+

when diffDate < 3888当 diffDate < 3888

+-----------+-------------+-------------+-------------+-----------+-------------+-------------+------------+
| diffDate  |id_device    |id_vehicule  |tracking_time|id_device  |id_vehicule  |tracking_time|longitude|
+-----------+-------------+-------------+-------------+ +-----------+-------------+-------------+-------------+
|222        |1            |5            |2020-05-30   | 1          |5            |2020-05-12   | 33.21111    |       
-----------+--------------+---------------+----------+----------+--------+-----------+--------------+-----------+         

when diffDate > 3888当 diffDate > 3888

 +-----------+-------------+-------------+-------------+-----------+-------------+-------------+------------+
| diffDate  |id_device    |id_vehicule  |tracking_time|id_device  |id_vehicule  |tracking_time|latitude|
+-----------+-------------+-------------+-------------+ +-----------+-------------+-------------+-------------+
|4700        |9            |5            |2019-03-01   | 8          |9            |2019-02-28   | 2.00000    |       
-----------+--------------+---------------+----------+----------+--------+-----------+--------------+-----------+         

I need your help我需要你的帮助

Thank you.谢谢你。

I think you need to revisit your code.我认为您需要重新访问您的代码。

You are trying to execute a join for each row of the dataframe1 (of course based on the condition), which is I think incorrect requirement or misunderstood requirement.您正在尝试为dataframe1的每一行执行连接(当然基于条件),我认为这是不正确的要求或误解的要求。

when(condition, then).otherwise() function executes for each row of the underlying dataframe and generally used to process the column based on condition. when(condition, then).otherwise() function 对底层 dataframe 的每一行执行,通常用于根据条件处理列。 then and else/otherwise clause in the function only supports literals which are existing columns in the dataframe primitive/ complex types and literals. function 中的thenelse/otherwise子句仅支持 dataframe 原始/复杂类型和literals中现有列的文字。 you can't put dataframe or any operation outputting the dataframe there你不能把 dataframe 或任何输出 dataframe 的操作放在那里

May be your requirement is to join the datafrmae1 with datafrmae2 for the rows where col("diffDate").lt(3888) .可能您的要求是将datafrmae1datafrmae2加入col("diffDate").lt(3888)所在的行。 TO achieve this you can do the following -为此,您可以执行以下操作 -

dataframe1.join(dataframe2,
                dataframe2.col("id_device").equalTo(dataframe1.col("id_device")).
                        and(dataframe2.col("id_vehicule").equalTo(dataframe1.col("id_vehicule"))).
                        and(dataframe2.col("tracking_time").lt(dataframe1.col("tracking_time"))).
                        and(dataframe1.col("diffDate").lt(3888))
                )
                        .orderBy(dataframe2.col("tracking_time").desc())

Edit-1编辑-1


        dataframe1.as("a").join(dataframe2.as("b"),
                dataframe2.col("id_device").equalTo(dataframe1.col("id_device")).
                        and(dataframe2.col("id_vehicule").equalTo(dataframe1.col("id_vehicule"))).
                        and(dataframe2.col("tracking_time").lt(dataframe1.col("tracking_time"))).
                        and(dataframe1.col("diffDate").lt(3888))
        ).selectExpr("a.*", "b.longitude", "null as latitude")
                .unionByName(
                        dataframe1.as("a").join(dataframe3.as("c"),
                                dataframe3.col("id_device").equalTo(dataframe1.col("id_device")).
                                        and(dataframe3.col("id_vehicule").equalTo(dataframe1.col("id_vehicule"))).
                                        and(dataframe3.col("tracking_time").lt(dataframe1.col("tracking_time"))).
                                        and(dataframe1.col("diffDate").geq(3888))
                        ).selectExpr("a.*", "c.latitude", "null as longitude")
                               
                )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM