簡體   English   中英

使用 Scala 以毫秒為單位的 Spark 2.0 時間戳差異

[英]Spark 2.0 Timestamp Difference in Milliseconds using Scala

我正在使用 Spark 2.0 並尋找一種在 Scala 中實現以下目標的方法:

需要兩個 Data-frame 列值之間的時間戳差異(以毫秒為單位)。

Value_1 = 06/13/2017 16:44:20.044
Value_2 = 06/13/2017 16:44:21.067

兩者的數據類型都是時間戳。

注意:將函數unix_timestamp(Column s)應用於兩個值和減法工作,但不能達到要求的毫秒值。

最終查詢如下所示:

Select **timestamp_diff**(Value_2,Value_1) from table1

這應該返回以下輸出:

1023 毫秒

其中timestamp_diff是計算毫秒差異的函數。

一種方法是使用Unix紀元時間,即自1970年1月1日以來的毫秒數。下面是使用UDF的示例,它需要兩個時間戳並以毫秒為單位返回它們之間的差異。

val timestamp_diff = udf((startTime: Timestamp, endTime: Timestamp) => {
  (startTime.getTime() - endTime.getTime())
})

val df = // dataframe with two timestamp columns (col1 and col2)
  .withColumn("diff", timestamp_diff(col("col2"), col("col1")))

或者,您可以注冊要與SQL命令一起使用的函數:

val timestamp_diff = (startTime: Timestamp, endTime: Timestamp) => {
  (startTime.getTime() - endTime.getTime())
}

spark.sqlContext.udf.register("timestamp_diff", timestamp_diff)
df.createOrReplaceTempView("table1")

val df2 = spark.sqlContext.sql("SELECT *, timestamp_diff(col2, col1) as diff from table1")

PySpark也是如此:

import datetime

def timestamp_diff(time1: datetime.datetime, time2: datetime.datetime):
    return int((time1-time2).total_seconds()*1000)

int*1000僅輸出毫秒

用法示例:

spark.udf.register("timestamp_diff", timestamp_diff)    

df.registerTempTable("table1")

df2 = spark.sql("SELECT *, timestamp_diff(col2, col1) as diff from table1")

它不是最佳解決方案,因為UDF通常很慢,因此您可能會遇到性能問題。

聚會有點晚了,但希望它仍然有用。

import org.apache.spark.sql.Column
def getUnixTimestamp(col: Column): Column = (col.cast("double") * 1000).cast("long")

df.withColumn("diff", getUnixTimestamp(col("col2")) - getUnixTimestamp(col("col1")))

當然,您可以為差異定義一個單獨的方法:

def timestampDiff(col1: Column, col2: Column): Column = getUnixTimestamp(col2) - getUnixTimestamp(col1)

df.withColumn("diff", timestampDiff(col("col1"), col("col2")))

為了讓生活更輕松,可以為String定義一個具有默認diff名稱的重載方法:

def timestampDiff(col1: String, col2: String): Column = timestampDiff(col(col1), col(col2)).as("diff")

現在在行動:

scala> df.show(false)
+-----------------------+-----------------------+
|min_time               |max_time               |
+-----------------------+-----------------------+
|1970-01-01 01:00:02.345|1970-01-01 01:00:04.786|
|1970-01-01 01:00:23.857|1970-01-01 01:00:23.999|
|1970-01-01 01:00:02.325|1970-01-01 01:01:07.688|
|1970-01-01 01:00:34.235|1970-01-01 01:00:34.444|
|1970-01-01 01:00:34.235|1970-01-01 01:00:34.454|
+-----------------------+-----------------------+


scala> df.withColumn("diff", timestampDiff("min_time", "max_time")).show(false)
+-----------------------+-----------------------+-----+
|min_time               |max_time               |diff |
+-----------------------+-----------------------+-----+
|1970-01-01 01:00:02.345|1970-01-01 01:00:04.786|2441 |
|1970-01-01 01:00:23.857|1970-01-01 01:00:23.999|142  |
|1970-01-01 01:00:02.325|1970-01-01 01:01:07.688|65363|
|1970-01-01 01:00:34.235|1970-01-01 01:00:34.444|209  |
|1970-01-01 01:00:34.235|1970-01-01 01:00:34.454|219  |
+-----------------------+-----------------------+-----+


scala> df.select(timestampDiff("min_time", "max_time")).show(false)
+-----+
|diff |
+-----+
|2441 |
|142  |
|65363|
|209  |
|219  |
+-----+

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM