如何将 pyspark 数据帧中的时间戳列值减少 1 毫秒

Question

I have pyspark data-frame which has timestamp column, I want to reduce timestamp by 1 ms.我有具有时间戳列的 pyspark 数据帧，我想将时间戳减少 1 毫秒。 Is there some in-built function available in spark for handling such scenario? spark中是否有一些内置的function可用于处理这种情况？

for example value for timestamp column: timestamp value: 2020-07-13 17:29:36例如时间戳列的值：时间戳值：2020-07-13 17:29:36

Answer 1

By using double type, you can do this.通过使用双精度类型，您可以做到这一点。

import pyspark.sql.functions as f

df = spark.createDataFrame([(1, '2020-07-13 17:29:36')], ['id', 'time'])

df.withColumn('time', f.to_timestamp('time', 'yyyy-MM-dd HH:mm:ss')) \
  .withColumn('timediff', (f.col('time').cast('double') - f.lit(0.001)).cast('timestamp')) \
  .show(10, False)

+---+-------------------+-----------------------+
|id |time               |timediff               |
+---+-------------------+-----------------------+
|1  |2020-07-13 17:29:36|2020-07-13 17:29:35.999|
+---+-------------------+-----------------------+

Answer 2

You can use pyspark.sql.functions.expr to subtract INTERVAL 1 milliseconds您可以使用pyspark.sql.functions.expr减去INTERVAL 1 milliseconds

from pyspark.sql.functions import expr

df = spark.createDataFrame([('2020-07-13 17:29:36',)], ['time'])
df = df.withColumn('time2', expr("time - INTERVAL 1 milliseconds"))
df.show(truncate=False)
#+-------------------+-----------------------+
#|time               |time2                  |
#+-------------------+-----------------------+
#|2020-07-13 17:29:36|2020-07-13 17:29:35.999|
#+-------------------+-----------------------+

Even if time is a string of this format, Spark will make an implicit conversion for you.即使time是这种格式的字符串，Spark 也会为你进行隐式转换。

df.printSchema()
#root
# |-- time: string (nullable = true)
# |-- time2: string (nullable = true)

Answer 3

You could also use INTERVAL with expr .您也可以将INTERVAL与expr一起使用。

import pyspark.sql.functions as F

df = spark.createDataFrame(
    [
        (1, '2020-07-13 17:29:36')
    ], 
    [
        'id', 'time'
    ]
)

df.withColumn(
    'time', 
    F.col('time').cast('timestamp')
).withColumn(
    'timediff', 
    (
        F.col('time') -  F.expr('INTERVAL 1 milliseconds')
    ).cast('timestamp') 
).show(truncate=False)

如何将 pyspark 数据帧中的时间戳列值减少 1 毫秒

问题描述

3 个解决方案

解决方案1
2 2020-08-18 12:49:23

解决方案2
2 2020-08-18 13:42:00

解决方案3
0 2020-08-18 13:43:36

如何将 pyspark 数据帧中的时间戳列值减少 1 毫秒

问题描述

3 个解决方案

解决方案1 2 2020-08-18 12:49:23

解决方案2 2 2020-08-18 13:42:00

解决方案3 0 2020-08-18 13:43:36

解决方案1
2 2020-08-18 12:49:23

解决方案2
2 2020-08-18 13:42:00

解决方案3
0 2020-08-18 13:43:36