[英]Spark Scala Timestamp comparison
I have to write a code in spark scala in which I have a dataframe with below values.我必须在 spark scala 中编写代码,其中我有一个具有以下值的 dataframe。
PIN![]() |
REPORT_DATE![]() |
JOB_NO![]() |
ISSUED_AT_TIME![]() |
COMMITED_AT_TIME ![]() |
---|---|---|---|---|
CLAAB90 ![]() |
2020-12-17 00:00:00 ![]() |
TEST1![]() |
2020-12-17 09:12:41 ![]() |
2020-12-17 11:10:12 ![]() |
CLAAB90 ![]() |
2020-12-17 00:00:00 ![]() |
TEST2![]() |
2020-12-17 11:10:08 ![]() |
2020-12-17 13:10:05 ![]() |
CLAAB90 ![]() |
2020-12-17 00:00:00 ![]() |
TEST3![]() |
2020-12-17 13:10:15 ![]() |
2020-12-17 15:10:15 ![]() |
CLAAC07 ![]() |
2020-12-17 00:00:00 ![]() |
TEST4![]() |
2020-12-17 11:00:00 ![]() |
2020-12-17 12:10:00 ![]() |
CLAAC07 ![]() |
2020-12-17 00:00:00 ![]() |
TEST8![]() |
2020-12-17 12:10:05 ![]() |
2020-12-17 12:15:00 ![]() |
CLAAB91 ![]() |
2020-12-17 00:00:00 ![]() |
TEST5![]() |
2020-12-17 10:12:41 ![]() |
2020-12-17 11:10:12 ![]() |
CLAAB91 ![]() |
2020-12-17 00:00:00 ![]() |
TEST6![]() |
2020-12-17 11:10:08 ![]() |
2020-12-17 13:10:05 ![]() |
CLAAB91 ![]() |
2020-12-17 00:00:00 ![]() |
TEST7![]() |
2020-12-17 13:10:00 ![]() |
2020-12-17 15:10:15 ![]() |
here the column ISSUED_AT_TIME and COMMITED_AT_TIME are timestamp values.这里的列 ISSUED_AT_TIME 和 COMMITED_AT_TIME 是时间戳值。 I need to compare the value between COMMITED_AT_TIME of row 1 and ISSUED_AT_TIME of row 2 and see which is greater.
我需要比较第 1 行的 COMMITED_AT_TIME 和第 2 行的 ISSUED_AT_TIME 之间的值,看看哪个更大。
I have tried to do this by declaring two variables and a list and then applying if condition我试图通过声明两个变量和一个列表然后应用 if 条件来做到这一点
var V_AWI_TIME_GLOBAL=""
var V_COM_TIME_GLOBAL=""
var list2_1 = df2.select("ISSUED_AT_TIME").map(r => r.getString(0)).collect.toList
if((list2_1(j) >= V_AWI_TIME_GLOBAL) & (list2_1(j) > V_COM_TIME_GLOBAL))
where the value of each of these variables is as below.其中每个变量的值如下所示。
list2_1(j)=2020-12-17 11:10:12 //The List2_1 is a list converted from above dataframe with ISSUED_AT_TIME
V_AWI_TIME_GLOBAL=2020-12-17 09:12:41
V_COM_TIME_GLOBAL=2020-12-17 11:10:12
Apparently it is treating each of these columns as string.显然,它将这些列中的每一列都视为字符串。 If I do not apply getString(0) in the list creation from dataframe then it is creating a list with the type Any.
如果我不在 dataframe 的列表创建中应用 getString(0),那么它会创建一个 Any 类型的列表。
As I go through your question, I still didn't get what was the exact expected output, Here is what I understand and might help当我 go 通过你的问题时,我仍然没有得到确切的预期 output,这是我的理解,可能会有所帮助
newDF.withColumn("ISSUED_AT_TIME_LAG",
lag($"COMMITED_AT_TIME", 1).over(Window.partitionBy($"PIN").orderBy($"JOB_NO")))
.withColumn("RESULT",
when($"ISSUED_AT_TIME_LAG" > $"COMMITED_AT_TIME", $"ISSUED_AT_TIME_LAG").otherwise($"COMMITED_AT_TIME")
).drop($"ISSUED_AT_TIME_LAG")
.show(false)
Output: Output:
+-------+-------------------+------+-------------------+-------------------+-------------------+
|PIN |REPORT_DATE |JOB_NO|ISSUED_AT_TIME |COMMITED_AT_TIME |RESULT |
+-------+-------------------+------+-------------------+-------------------+-------------------+
|CLAAC07|2020-12-17 00:00:00|TEST4 |2020-12-17 11:00:00|2020-12-17 12:10:00|2020-12-17 12:10:00|
|CLAAC07|2020-12-17 00:00:00|TEST8 |2020-12-17 12:10:05|2020-12-17 12:15:00|2020-12-17 12:15:00|
|CLAAB91|2020-12-17 00:00:00|TEST5 |2020-12-17 10:12:41|2020-12-17 11:10:12|2020-12-17 11:10:12|
|CLAAB91|2020-12-17 00:00:00|TEST6 |2020-12-17 11:10:08|2020-12-17 13:10:05|2020-12-17 13:10:05|
|CLAAB91|2020-12-17 00:00:00|TEST7 |2020-12-17 13:10:00|2020-12-17 15:10:15|2020-12-17 15:10:15|
|CLAAB90|2020-12-17 00:00:00|TEST1 |2020-12-17 09:12:41|2020-12-17 11:10:12|2020-12-17 11:10:12|
|CLAAB90|2020-12-17 00:00:00|TEST2 |2020-12-17 11:10:08|2020-12-17 13:10:05|2020-12-17 13:10:05|
|CLAAB90|2020-12-17 00:00:00|TEST3 |2020-12-17 13:10:15|2020-12-17 15:10:15|2020-12-17 15:10:15|
+-------+-------------------+------+-------------------+-------------------+-------------------+
I tried to compare the timestamp column by casting it to long datatype and then taking the difference.我试图通过将时间戳列转换为长数据类型然后取差来比较时间戳列。 Mostly not the most efficient method but it worked in my case.
大多数情况下不是最有效的方法,但它适用于我的情况。
//Create a view from the dataframe
df2.createOrReplaceGlobalTempView("RnView_row_1")
//Create a dataframe followed by a view with 1st Row
sql="select PIN as pin ,REPORT_DATE as report_date,JOB_NO as JOB_NO_FIRST,JOB_NO as
JOB_NO_LAST,ISSUED_AT_TIME as FINAL_AWI,COMMITED_AT_TIME as FINAL_COM,1 as
OVERLAP_ROWCOUNT_DIVISOR from global_temp.RnView_row_1 where row_number=%s".format(j)
var temp_df_11=spark.sql(sql)
temp_df_11.createOrReplaceGlobalTempView("row_11")
//Create a dataframe followed by a view with 2nd Row
sql="select PIN as pin ,REPORT_DATE as report_date,JOB_NO as JOB_NO_FIRST,JOB_NO as
JOB_NO_LAST,ISSUED_AT_TIME as FINAL_AWI,COMMITED_AT_TIME as FINAL_COM,1 as
OVERLAP_ROWCOUNT_DIVISOR from global_temp.RnView_row_1 where
row_number=%s".format(j+1)
var temp_df_12=spark.sql(sql)
temp_df_12.createOrReplaceGlobalTempView("row_12")
//Create a dataframe by joining the two views to achive a single row with required
//values from row 1 and row 2
sql="select row_11.pin, row_11.report_date,row_11.JOB_NO as JOB_NO_FIRST,
row_12.JOB_NO as JOB_NO_LAST ,row_12.FINAL_AWI as ROW_2_FINAL_AWI,row_12.FINAL_COM as
ROW_2_FINAL_COM, row_11.FINAL_COM as ROW_1_FINAL_COM,row_11.FINAL_AWI as
ROW_1_FINAL_AWI,1 as OVERLAP_ROWCOUNT_DIVISOR from global_temp.row_11 join
global_temp.row_12 on row_11.pin=row_12.pin"
var temp_df_13=spark.sql(sql)
temp_df_13.createOrReplaceGlobalTempView("Final_view_1")
//Creating a dataframe where we cast the timestamp to long and take the difference
var temp_df_1=temp_df_13.withColumn("ROW_2_FINAL_AWI",to_timestamp(col("ROW_2_FINAL_AWI"))).withColumn("ROW_1_FINAL_COM",
to_timestamp(col("ROW_1_FINAL_COM"))).withColumn("ROW_1_FINAL_AWI",
to_timestamp(col("ROW_1_FINAL_AWI"))).withColumn("ROW_2_FINAL_COM",
to_timestamp(col("ROW_2_FINAL_COM"))).withColumn("Diff_In_row2AWI_and_row1_COM",col("ROW_2_FINAL_AWI").cast(LongType) - col("ROW_1_FINAL_COM").cast(LongType)).withColumn("Diff_In_row2AWI_and_row1_AWI",col("ROW_2_FINAL_AWI").cast(LongType) - col("ROW_1_FINAL_AWI").cast(LongType)).withColumn("Diff_In_row1COM_and_row2_COM",col("ROW_1_FINAL_COM").cast(LongType) - col("ROW_2_FINAL_COM").cast(LongType))
//then compare if the difference is greater than 0 or not to see which is greater
if(Diff_In_row1COM_and_row2_COM_1>=0)
{
println("hello"
}
The Output with difference after converting to Long is转换为Long后差值Output为
pin![]() |
report_date![]() |
JOB_NO_FIRST ![]() |
JOB_NO_LAST ![]() |
ROW_2_FINAL_AWI ![]() |
ROW_2_FINAL_COM ![]() |
ROW_1_FINAL_COM ![]() |
ROW_1_FINAL_AWI ![]() |
OVERLAP_ROWCOUNT_DIVISOR ![]() |
Diff_In_row2AWI_and_row1_COM ![]() |
Diff_In_row2AWI_and_row1_AWI ![]() |
Diff_In_row1COM_and_row2_COM ![]() |
---|---|---|---|---|---|---|---|---|---|---|---|
CLAAB90 ![]() |
2020-12-17 00:00:00 ![]() |
TEST1![]() |
TEST2![]() |
2020-12-17 11:10:08 ![]() |
2020-12-17 13:10:05 ![]() |
2020-12-17 11:10:12 ![]() |
2020-12-17 09:12:41 ![]() |
1 ![]() |
-4 ![]() |
7047 ![]() |
-7193 ![]() |
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.