简体   繁体   English

Spark Scala 时间戳对比

[英]Spark Scala Timestamp comparison

I have to write a code in spark scala in which I have a dataframe with below values.我必须在 spark scala 中编写代码,其中我有一个具有以下值的 dataframe。

PIN别针 REPORT_DATE报告日期 JOB_NO作业编号 ISSUED_AT_TIME发行_AT_TIME COMMITED_AT_TIME COMMITED_AT_TIME
CLAAB90 CLAAB90 2020-12-17 00:00:00 2020-12-17 00:00:00 TEST1测试1 2020-12-17 09:12:41 2020-12-17 09:12:41 2020-12-17 11:10:12 2020-12-17 11:10:12
CLAAB90 CLAAB90 2020-12-17 00:00:00 2020-12-17 00:00:00 TEST2测试2 2020-12-17 11:10:08 2020-12-17 11:10:08 2020-12-17 13:10:05 2020-12-17 13:10:05
CLAAB90 CLAAB90 2020-12-17 00:00:00 2020-12-17 00:00:00 TEST3测试3 2020-12-17 13:10:15 2020-12-17 13:10:15 2020-12-17 15:10:15 2020-12-17 15:10:15
CLAAC07 CLAAC07 2020-12-17 00:00:00 2020-12-17 00:00:00 TEST4测试4 2020-12-17 11:00:00 2020-12-17 11:00:00 2020-12-17 12:10:00 2020-12-17 12:10:00
CLAAC07 CLAAC07 2020-12-17 00:00:00 2020-12-17 00:00:00 TEST8测试8 2020-12-17 12:10:05 2020-12-17 12:10:05 2020-12-17 12:15:00 2020-12-17 12:15:00
CLAAB91 CLAAB91 2020-12-17 00:00:00 2020-12-17 00:00:00 TEST5测试5 2020-12-17 10:12:41 2020-12-17 10:12:41 2020-12-17 11:10:12 2020-12-17 11:10:12
CLAAB91 CLAAB91 2020-12-17 00:00:00 2020-12-17 00:00:00 TEST6测试6 2020-12-17 11:10:08 2020-12-17 11:10:08 2020-12-17 13:10:05 2020-12-17 13:10:05
CLAAB91 CLAAB91 2020-12-17 00:00:00 2020-12-17 00:00:00 TEST7测试7 2020-12-17 13:10:00 2020-12-17 13:10:00 2020-12-17 15:10:15 2020-12-17 15:10:15

here the column ISSUED_AT_TIME and COMMITED_AT_TIME are timestamp values.这里的列 ISSUED_AT_TIME 和 COMMITED_AT_TIME 是时间戳值。 I need to compare the value between COMMITED_AT_TIME of row 1 and ISSUED_AT_TIME of row 2 and see which is greater.我需要比较第 1 行的 COMMITED_AT_TIME 和第 2 行的 ISSUED_AT_TIME 之间的值,看看哪个更大。

I have tried to do this by declaring two variables and a list and then applying if condition我试图通过声明两个变量和一个列表然后应用 if 条件来做到这一点

var V_AWI_TIME_GLOBAL=""
var V_COM_TIME_GLOBAL=""
var list2_1 = df2.select("ISSUED_AT_TIME").map(r => r.getString(0)).collect.toList

if((list2_1(j) >= V_AWI_TIME_GLOBAL) & (list2_1(j) > V_COM_TIME_GLOBAL))

where the value of each of these variables is as below.其中每个变量的值如下所示。

list2_1(j)=2020-12-17 11:10:12  //The List2_1 is a list converted from above dataframe with ISSUED_AT_TIME 
V_AWI_TIME_GLOBAL=2020-12-17 09:12:41
V_COM_TIME_GLOBAL=2020-12-17 11:10:12

Apparently it is treating each of these columns as string.显然,它将这些列中的每一列都视为字符串。 If I do not apply getString(0) in the list creation from dataframe then it is creating a list with the type Any.如果我不在 dataframe 的列表创建中应用 getString(0),那么它会创建一个 Any 类型的列表。

As I go through your question, I still didn't get what was the exact expected output, Here is what I understand and might help当我 go 通过你的问题时,我仍然没有得到确切的预期 output,这是我的理解,可能会有所帮助

newDF.withColumn("ISSUED_AT_TIME_LAG", 
  lag($"COMMITED_AT_TIME", 1).over(Window.partitionBy($"PIN").orderBy($"JOB_NO")))
    .withColumn("RESULT",
      when($"ISSUED_AT_TIME_LAG" > $"COMMITED_AT_TIME", $"ISSUED_AT_TIME_LAG").otherwise($"COMMITED_AT_TIME")
    ).drop($"ISSUED_AT_TIME_LAG")
    .show(false)

Output: Output:

+-------+-------------------+------+-------------------+-------------------+-------------------+
|PIN    |REPORT_DATE        |JOB_NO|ISSUED_AT_TIME     |COMMITED_AT_TIME   |RESULT             |
+-------+-------------------+------+-------------------+-------------------+-------------------+
|CLAAC07|2020-12-17 00:00:00|TEST4 |2020-12-17 11:00:00|2020-12-17 12:10:00|2020-12-17 12:10:00|
|CLAAC07|2020-12-17 00:00:00|TEST8 |2020-12-17 12:10:05|2020-12-17 12:15:00|2020-12-17 12:15:00|
|CLAAB91|2020-12-17 00:00:00|TEST5 |2020-12-17 10:12:41|2020-12-17 11:10:12|2020-12-17 11:10:12|
|CLAAB91|2020-12-17 00:00:00|TEST6 |2020-12-17 11:10:08|2020-12-17 13:10:05|2020-12-17 13:10:05|
|CLAAB91|2020-12-17 00:00:00|TEST7 |2020-12-17 13:10:00|2020-12-17 15:10:15|2020-12-17 15:10:15|
|CLAAB90|2020-12-17 00:00:00|TEST1 |2020-12-17 09:12:41|2020-12-17 11:10:12|2020-12-17 11:10:12|
|CLAAB90|2020-12-17 00:00:00|TEST2 |2020-12-17 11:10:08|2020-12-17 13:10:05|2020-12-17 13:10:05|
|CLAAB90|2020-12-17 00:00:00|TEST3 |2020-12-17 13:10:15|2020-12-17 15:10:15|2020-12-17 15:10:15|
+-------+-------------------+------+-------------------+-------------------+-------------------+

I tried to compare the timestamp column by casting it to long datatype and then taking the difference.我试图通过将时间戳列转换为长数据类型然后取差来比较时间戳列。 Mostly not the most efficient method but it worked in my case.大多数情况下不是最有效的方法,但它适用于我的情况。

    //Create a view from the dataframe 
    df2.createOrReplaceGlobalTempView("RnView_row_1")
    //Create a dataframe followed by a view with 1st Row
    sql="select PIN as pin ,REPORT_DATE as report_date,JOB_NO as JOB_NO_FIRST,JOB_NO as 
    JOB_NO_LAST,ISSUED_AT_TIME as FINAL_AWI,COMMITED_AT_TIME as FINAL_COM,1 as 
    OVERLAP_ROWCOUNT_DIVISOR from global_temp.RnView_row_1 where row_number=%s".format(j)
    var temp_df_11=spark.sql(sql)
    temp_df_11.createOrReplaceGlobalTempView("row_11")
    //Create a dataframe followed by a view with 2nd Row
    sql="select PIN as pin ,REPORT_DATE as report_date,JOB_NO as JOB_NO_FIRST,JOB_NO as 
    JOB_NO_LAST,ISSUED_AT_TIME as FINAL_AWI,COMMITED_AT_TIME as FINAL_COM,1 as 
    OVERLAP_ROWCOUNT_DIVISOR from global_temp.RnView_row_1 where 
    row_number=%s".format(j+1)
    var temp_df_12=spark.sql(sql)
    temp_df_12.createOrReplaceGlobalTempView("row_12")
    //Create a dataframe by joining the two views to achive a single row with required 
    //values from row 1 and row 2
    sql="select row_11.pin, row_11.report_date,row_11.JOB_NO as JOB_NO_FIRST, 
    row_12.JOB_NO as JOB_NO_LAST ,row_12.FINAL_AWI as ROW_2_FINAL_AWI,row_12.FINAL_COM as 
    ROW_2_FINAL_COM, row_11.FINAL_COM as ROW_1_FINAL_COM,row_11.FINAL_AWI as 
    ROW_1_FINAL_AWI,1 as OVERLAP_ROWCOUNT_DIVISOR from global_temp.row_11 join  
    global_temp.row_12 on row_11.pin=row_12.pin"
    var temp_df_13=spark.sql(sql)
    temp_df_13.createOrReplaceGlobalTempView("Final_view_1")

    //Creating a dataframe where we cast the timestamp to long and take the difference
    var temp_df_1=temp_df_13.withColumn("ROW_2_FINAL_AWI",to_timestamp(col("ROW_2_FINAL_AWI"))).withColumn("ROW_1_FINAL_COM", 
    to_timestamp(col("ROW_1_FINAL_COM"))).withColumn("ROW_1_FINAL_AWI", 
    to_timestamp(col("ROW_1_FINAL_AWI"))).withColumn("ROW_2_FINAL_COM", 
to_timestamp(col("ROW_2_FINAL_COM"))).withColumn("Diff_In_row2AWI_and_row1_COM",col("ROW_2_FINAL_AWI").cast(LongType) - col("ROW_1_FINAL_COM").cast(LongType)).withColumn("Diff_In_row2AWI_and_row1_AWI",col("ROW_2_FINAL_AWI").cast(LongType) - col("ROW_1_FINAL_AWI").cast(LongType)).withColumn("Diff_In_row1COM_and_row2_COM",col("ROW_1_FINAL_COM").cast(LongType) - col("ROW_2_FINAL_COM").cast(LongType))

   //then compare if the difference is  greater than 0  or not to see which is greater
    if(Diff_In_row1COM_and_row2_COM_1>=0)
    {
    println("hello"
    }

The Output with difference after converting to Long is转换为Long后差值Output为

pin别针 report_date报告日期 JOB_NO_FIRST JOB_NO_FIRST JOB_NO_LAST JOB_NO_LAST ROW_2_FINAL_AWI ROW_2_FINAL_AWI ROW_2_FINAL_COM ROW_2_FINAL_COM ROW_1_FINAL_COM ROW_1_FINAL_COM ROW_1_FINAL_AWI ROW_1_FINAL_AWI OVERLAP_ROWCOUNT_DIVISOR OVERLAP_ROWCOUNT_DIVISOR Diff_In_row2AWI_and_row1_COM Diff_In_row2AWI_and_row1_COM Diff_In_row2AWI_and_row1_AWI Diff_In_row2AWI_and_row1_AWI Diff_In_row1COM_and_row2_COM Diff_In_row1COM_and_row2_COM
CLAAB90 CLAAB90 2020-12-17 00:00:00 2020-12-17 00:00:00 TEST1测试1 TEST2测试2 2020-12-17 11:10:08 2020-12-17 11:10:08 2020-12-17 13:10:05 2020-12-17 13:10:05 2020-12-17 11:10:12 2020-12-17 11:10:12 2020-12-17 09:12:41 2020-12-17 09:12:41 1 1个 -4 -4 7047 7047 -7193 -7193

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM