[英]Correct join of DataFrame in Spark?
I am new in Spark Framework and need some help! 我是Spark Framework的新手,需要帮助!
Assume that the first DataFrame ( df1
) stores the time that users access the call center. 假设第一个DataFrame(
df1
)存储用户访问呼叫中心的时间。
+---------+-------------------+
|USER_NAME| REQUEST_DATE|
+---------+-------------------+
| Mark|2018-02-20 00:00:00|
| Alex|2018-03-01 00:00:00|
| Bob|2018-03-01 00:00:00|
| Mark|2018-07-01 00:00:00|
| Kate|2018-07-01 00:00:00|
+---------+-------------------+
The second DataFrame stores information about whether a person is a member of the organization. 第二个DataFrame存储有关某人是否是组织成员的信息。
OUT
means that the user has left the organization. OUT
表示用户已离开组织。 IN
means that the user has come to the organization. IN
表示用户已来到组织。 START_DATE
and END_DATE
mean the beginning and end of the corresponding process. START_DATE
和END_DATE
表示相应过程的开始和结束。
For example, you can see that Alex
left the organization at 2018-01-01 00:00:00
, and this process ended at 2018-02-01 00:00:00
. 例如,您可以看到
Alex
在2018-01-01 00:00:00
离开了组织,此过程在2018-02-01 00:00:00
结束。 You can notice that one user can come and left the organization at different times as Mark
. 您会注意到,一个用户可以像
Mark
一样在不同的时间离开和离开组织。
+---------+---------------------+---------------------+--------+
|NAME | START_DATE | END_DATE | STATUS |
+---------+---------------------+---------------------+--------+
| Alex| 2018-01-01 00:00:00 | 2018-02-01 00:00:00 | OUT |
| Bob| 2018-02-01 00:00:00 | 2018-02-05 00:00:00 | IN |
| Mark| 2018-02-01 00:00:00 | 2018-03-01 00:00:00 | IN |
| Mark| 2018-05-01 00:00:00 | 2018-08-01 00:00:00 | OUT |
| Meggy| 2018-02-01 00:00:00 | 2018-02-01 00:00:00 | OUT |
+----------+--------------------+---------------------+--------+
I'm trying to get such a DataFrame in the final. 我试图在最后获得这样的DataFrame。 It must contain all records from the first DataFrame plus a column indicating whether the Person is a member of the organization at the time of the request (
REQUEST_DATE
) or not. 它必须包含来自第一个DataFrame的所有记录以及一列,该列指示在请求之时Person是否是组织的成员(
REQUEST_DATE
)。
+---------+-------------------+----------------+
|USER_NAME| REQUEST_DATE| USER_STATUS |
+---------+-------------------+----------------+
| Mark|2018-02-20 00:00:00| Our user |
| Alex|2018-03-01 00:00:00| Not our user |
| Bob|2018-03-01 00:00:00| Our user |
| Mark|2018-07-01 00:00:00| Not our user |
| Kate|2018-07-01 00:00:00| No Information |
+---------+-------------------+----------------+
CODE: 码:
val df1: DataFrame = Seq(
("Mark", "2018-02-20 00:00:00"),
("Alex", "2018-03-01 00:00:00"),
("Bob", "2018-03-01 00:00:00"),
("Mark", "2018-07-01 00:00:00"),
("Kate", "2018-07-01 00:00:00")
).toDF("USER_NAME", "REQUEST_DATE")
df1.show()
val df2: DataFrame = Seq(
("Alex", "2018-01-01 00:00:00", "2018-02-01 00:00:00", "OUT"),
("Bob", "2018-02-01 00:00:00", "2018-02-05 00:00:00", "IN"),
("Mark", "2018-02-01 00:00:00", "2018-03-01 00:00:00", "IN"),
("Mark", "2018-05-01 00:00:00", "2018-08-01 00:00:00", "OUT"),
("Meggy", "2018-02-01 00:00:00", "2018-02-01 00:00:00", "OUT")
).toDF("NAME", "START_DATE", "END_DATE", "STATUS")
df2.show()
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.functions._
case class UserAndRequest(
USER_NAME:String,
REQUEST_DATE:java.sql.Date,
START_DATE:java.sql.Date,
END_DATE:java.sql.Date,
STATUS:String,
REQUEST_ID:Long
)
val joined : Dataset[UserAndRequest] = df1.withColumn("REQUEST_ID", monotonically_increasing_id).
join(df2,$"USER_NAME" === $"NAME", "left").
as[UserAndRequest]
val lastRowByRequestId = joined.
groupByKey(_.REQUEST_ID).
reduceGroups( (x,y) =>
if (x.REQUEST_DATE.getTime > x.END_DATE.getTime && x.END_DATE.getTime > y.END_DATE.getTime) x else y
).map(_._2)
def logic(status: String): String = {
if (status == "IN") "Our user"
else if (status == "OUT") "not our user"
else "No Information"
}
val logicUDF = udf(logic _)
val finalDF = lastRowByRequestId.withColumn("USER_STATUS",logicUDF($"REQUEST_DATE"))
which yield : 哪个产量:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.