繁体   English   中英

使用 scala 在 spark 中基于日期时间列创建两行

[英]Create two rows based on a date time column in spark using scala

我有一个列 session_start 和 session 结束的 DF。 我需要创建另一行,所以如果开始和结束在不同的日期。 例如:我们有 df 作为

session_start session_stop
01-05-2021 23:11:40 02-05-2021 02:13:25

所以新的 output df 应该把它分成两行,比如:

session_start session_stop
01-05-2021 23:11:40 2021 年 1 月 5 日 23:59:59
02-05-2021 00:00:00 02-05-2021 02:13:25

所有其他列是否应在两行中保持通用。

您可以在 DF 上使用flatMap操作。

您在 flatMap 中使用的 function 将生成一或两行。

我在没有 flatMap function 的情况下做到了。 创建了一个 UDF generateOverlappedSessionsFromTimestampRanges 进行转换并使用它如下

// UDF
import java.sql.Timestamp
import java.time.temporal.ChronoUnit
import java.time.LocalDateTime

val generateOverlappedSessionsFromTimestampRanges = udf {(localStartTimestamp: Timestamp, localEndTimestamp: Timestamp) =>
    val localStartLdt = localStartTimestamp.toLocalDateTime
    val localEndLdt = localEndTimestamp.toLocalDateTime
    
    var output : List[(Timestamp, Timestamp)] = List()
    if(localStartLdt.toLocalDate().until(localEndLdt.toLocalDate(), ChronoUnit.DAYS) > 0) {
      val newLocalEndLdt = LocalDateTime.of(localStartLdt.getYear(), localStartLdt.getMonth(), localStartLdt.getDayOfMonth(), 23, 59, 59)
      val newLocalStartLdt = LocalDateTime.of(localEndLdt.getYear(), localEndLdt.getMonth(), localEndLdt.getDayOfMonth(), 0, 0, 0)
      output = output :+ (Timestamp.valueOf(localStartLdt),
                        Timestamp.valueOf(newLocalEndLdt)
                        )
      output = output :+ (Timestamp.valueOf(newLocalStartLdt),
                        Timestamp.valueOf(localEndLdt)
                        )
    } else {
      output = output :+ (Timestamp.valueOf(localStartLdt),
                        Timestamp.valueOf(localEndLdt)
                        )
    }
    output
  }
//Unit test case for above UDF
import org.apache.spark.sql.functions._
import java.sql.Timestamp
import org.apache.spark.sql.types.TimestampType
val timestamps: Seq[(Timestamp, Timestamp)] = Seq(
  (Timestamp.valueOf("2020-02-10 22:07:25.000"),
  Timestamp.valueOf("2020-02-11 02:07:25.000")
  )
  )
val timestampsDf = timestamps.toDF("local_session_start_timestamp", "local_session_stop_timestamp")
var output = timestampsDf.withColumn("to_be_explode", TimeUtil.generateOverlappedSessionsFromTimestampRanges1(timestampsDf("local_session_start_timestamp"),
                                                                    timestampsDf("local_session_stop_timestamp")
                                                                   ))
output = output.withColumn("exploded_session_time",explode(col("to_be_explode")))
        .withColumn("new_local_session_start",col("exploded_session_time._1"))
        .withColumn("new_local_session_stop", col("exploded_session_time._2"))
        .drop("to_be_explode", "exploded_session_time")
display(output)
df.withColumn("to_be_explode", generateOverlappedSessionsFromTimestampRanges(df("session_start"), df("session_stop")))
        .withColumn("exploded_session_time",explode(col("to_be_explode")))
        .withColumn("session_start",col("exploded_session_time._1"))
        .withColumn("session_stop", col("exploded_session_time._2"))
        .drop("to_be_explode", "exploded_session_time")

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM