在Spark中正确联接DataFrame吗？

Question

I am new in Spark Framework and need some help! 我是Spark Framework的新手，需要帮助！

Assume that the first DataFrame ( df1 ) stores the time that users access the call center. 假设第一个DataFrame（ df1 ）存储用户访问呼叫中心的时间。

+---------+-------------------+
|USER_NAME|       REQUEST_DATE|
+---------+-------------------+
|     Mark|2018-02-20 00:00:00|
|     Alex|2018-03-01 00:00:00|
|      Bob|2018-03-01 00:00:00|
|     Mark|2018-07-01 00:00:00|
|     Kate|2018-07-01 00:00:00|
+---------+-------------------+

The second DataFrame stores information about whether a person is a member of the organization. 第二个DataFrame存储有关某人是否是组织成员的信息。 OUT means that the user has left the organization. OUT表示用户已离开组织。 IN means that the user has come to the organization. IN表示用户已来到组织。 START_DATE and END_DATE mean the beginning and end of the corresponding process. START_DATE和END_DATE表示相应过程的开始和结束。

For example, you can see that Alex left the organization at 2018-01-01 00:00:00 , and this process ended at 2018-02-01 00:00:00 . 例如，您可以看到Alex在2018-01-01 00:00:00离开了组织，此过程在2018-02-01 00:00:00结束。 You can notice that one user can come and left the organization at different times as Mark . 您会注意到，一个用户可以像Mark一样在不同的时间离开和离开组织。

+---------+---------------------+---------------------+--------+
|NAME     | START_DATE          | END_DATE            | STATUS |
+---------+---------------------+---------------------+--------+
|     Alex| 2018-01-01 00:00:00 | 2018-02-01 00:00:00 | OUT    |
|      Bob| 2018-02-01 00:00:00 | 2018-02-05 00:00:00 | IN     |
|     Mark| 2018-02-01 00:00:00 | 2018-03-01 00:00:00 | IN     |
|     Mark| 2018-05-01 00:00:00 | 2018-08-01 00:00:00 | OUT    |
|    Meggy| 2018-02-01 00:00:00 | 2018-02-01 00:00:00 | OUT    |
+----------+--------------------+---------------------+--------+

I'm trying to get such a DataFrame in the final. 我试图在最后获得这样的DataFrame。 It must contain all records from the first DataFrame plus a column indicating whether the Person is a member of the organization at the time of the request ( REQUEST_DATE ) or not. 它必须包含来自第一个DataFrame的所有记录以及一列，该列指示在请求之时Person是否是组织的成员（ REQUEST_DATE ）。

+---------+-------------------+----------------+
|USER_NAME|       REQUEST_DATE| USER_STATUS    |
+---------+-------------------+----------------+
|     Mark|2018-02-20 00:00:00| Our user       |
|     Alex|2018-03-01 00:00:00| Not our user   |
|      Bob|2018-03-01 00:00:00| Our user       |
|     Mark|2018-07-01 00:00:00| Not our user   |
|     Kate|2018-07-01 00:00:00| No Information |
+---------+-------------------+----------------+

CODE: 码：

val df1: DataFrame  = Seq(
    ("Mark", "2018-02-20 00:00:00"),
    ("Alex", "2018-03-01 00:00:00"),
    ("Bob", "2018-03-01 00:00:00"),
    ("Mark", "2018-07-01 00:00:00"),
    ("Kate", "2018-07-01 00:00:00")
).toDF("USER_NAME", "REQUEST_DATE")

df1.show()

val df2: DataFrame  = Seq(
    ("Alex", "2018-01-01 00:00:00", "2018-02-01 00:00:00", "OUT"),
    ("Bob", "2018-02-01 00:00:00", "2018-02-05 00:00:00", "IN"),
    ("Mark", "2018-02-01 00:00:00", "2018-03-01 00:00:00", "IN"),
    ("Mark", "2018-05-01 00:00:00", "2018-08-01 00:00:00", "OUT"),
    ("Meggy", "2018-02-01 00:00:00", "2018-02-01 00:00:00", "OUT")
).toDF("NAME", "START_DATE", "END_DATE", "STATUS")

df2.show()

Answer 1

import org.apache.spark.sql.Dataset
import org.apache.spark.sql.functions._

case class UserAndRequest(
                           USER_NAME:String,
                           REQUEST_DATE:java.sql.Date,
                           START_DATE:java.sql.Date,
                           END_DATE:java.sql.Date,
                           STATUS:String,
                           REQUEST_ID:Long
                         )

val joined : Dataset[UserAndRequest] = df1.withColumn("REQUEST_ID", monotonically_increasing_id).
  join(df2,$"USER_NAME" === $"NAME", "left").
  as[UserAndRequest]

val lastRowByRequestId = joined.
  groupByKey(_.REQUEST_ID).
  reduceGroups( (x,y) =>
    if (x.REQUEST_DATE.getTime > x.END_DATE.getTime && x.END_DATE.getTime > y.END_DATE.getTime) x else y
  ).map(_._2)

def logic(status: String): String = {
  if (status == "IN") "Our user"
  else if (status == "OUT") "not our user"
  else "No Information"
}

val logicUDF = udf(logic _)

val finalDF = lastRowByRequestId.withColumn("USER_STATUS",logicUDF($"REQUEST_DATE"))

which yield : 哪个产量：

在Spark中正确联接DataFrame吗？

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-01-18 14:02:09

在Spark中正确联接DataFrame吗？

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-01-18 14:02:09

解决方案1
1 已采纳 2019-01-18 14:02:09