PySpark 数据帧将异常字符串格式转换为时间戳

Question

I am using PySpark through Spark 1.5.0.我通过 Spark 1.5.0 使用 PySpark。 I have an unusual String format in rows of a column for datetime values.我在日期时间值的列的行中有一个不寻常的字符串格式。 It looks like this:它看起来像这样：

Row[(datetime='2016_08_21 11_31_08')]

Is there a way to convert this unorthodox yyyy_mm_dd hh_mm_dd format into a Timestamp?有没有办法将这种非正统的yyyy_mm_dd hh_mm_dd格式转换为时间戳？ Something that can eventually come along the lines of最终可以实现的东西

df = df.withColumn("date_time",df.datetime.astype('Timestamp'))

I had thought that Spark SQL functions like regexp_replace could work, but of course I need to replace _ with - in the date half and _ with : in the time part.我原以为像星火SQL函数regexp_replace可以工作，但我当然需要更换_用-在日期一半_用:在时间的一部分。

I was thinking I could split the column in 2 using substring and count backward from the end of time.我想我可以使用substring将列拆分为 2 并从时间结束向后计数。 Then do the 'regexp_replace' separately, then concatenate.然后分别执行'regexp_replace'，然后连接。 But this seems to many operations?但这似乎操作很多？ Is there an easier way?有没有更简单的方法？

Answer 1

Spark >= 2.2火花 >= 2.2

from pyspark.sql.functions import to_timestamp

(sc
    .parallelize([Row(dt='2016_08_21 11_31_08')])
    .toDF()
    .withColumn("parsed", to_timestamp("dt", "yyyy_MM_dd HH_mm_ss"))
    .show(1, False))

## +-------------------+-------------------+
## |dt                 |parsed             |
## +-------------------+-------------------+
## |2016_08_21 11_31_08|2016-08-21 11:31:08|
## +-------------------+-------------------+

Spark < 2.2火花 < 2.2

It is nothing that unix_timestamp cannot handle:没有什么是unix_timestamp无法处理的：

from pyspark.sql import Row
from pyspark.sql.functions import unix_timestamp

(sc
    .parallelize([Row(dt='2016_08_21 11_31_08')])
    .toDF()
    .withColumn("parsed", unix_timestamp("dt", "yyyy_MM_dd HH_mm_ss")
    # For Spark <= 1.5
    # See issues.apache.org/jira/browse/SPARK-11724 
    .cast("double")
    .cast("timestamp"))
    .show(1, False))

## +-------------------+---------------------+
## |dt                 |parsed               |
## +-------------------+---------------------+
## |2016_08_21 11_31_08|2016-08-21 11:31:08.0|
## +-------------------+---------------------+

In both cases the format string should be compatible with Java SimpleDateFormat .在这两种情况下，格式字符串都应该与 Java SimpleDateFormat兼容。

Answer 2

zero323's answer answers the question, but I wanted to add that if your datetime string has a standard format, you should be able to cast it directly into timestamp type: zero323 的回答回答了这个问题，但我想补充一点，如果您的日期时间字符串具有标准格式，您应该能够将其直接转换为时间戳类型：

df.withColumn('datetime', col('datetime_str').cast('timestamp'))

It has the advantage of handling milliseconds , while unix_timestamp only has only second-precision ( to_timestamp works with milliseconds too but requires Spark >= 2.2 as zero323 stated).它具有处理毫秒的优势，而unix_timestamp只有秒精度（ to_timestamp 也适用于毫秒，但要求 Spark >= 2.2 如 zero323 所述）。 I tested it on Spark 2.3.0, using the following format: '2016-07-13 14:33:53.979' (with milliseconds, but it also works without them).我在 Spark 2.3.0 上测试了它，使用以下格式：'2016-07-13 14:33:53.979'（以毫秒为单位，但没有它们也能工作）。

Answer 3

我完全同意所选的答案，但是我想将格式设置为 'yyyy_MM_dd HH_mm_ss' 以避免出现诸如 '2019_01_27 16_00_00' -> Note hours > 12 这样的时间戳问题

Answer 4

I add some more code lines from Florent F's answer for better understanding and running the snippet in local machine:我从Florent F 的回答中添加了更多代码行，以便更好地理解和在本地机器上运行代码段：

import os, pdb, sys
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.types import StructType, ArrayType  
from pyspark.sql.types import StringType
from pyspark.sql.functions import col

sc = pyspark.SparkContext('local[*]')
spark = SparkSession.builder.getOrCreate()

# preparing some example data - df1 with String type and df2 with Timestamp type
df1 = sc.parallelize([{"key":"a", "date":"2016-02-01"}, 
    {"key":"b", "date":"2016-02-02"}]).toDF()
df1.show()

df2 = df1.withColumn('datetime', col('date').cast("timestamp"))
df2.show()

PySpark 数据帧将异常字符串格式转换为时间戳

问题描述

4 个解决方案

解决方案1
60 已采纳 2016-08-22 21:35:54

解决方案2
15 2018-09-17 14:18:40

解决方案3
1 2019-01-27 15:22:56

解决方案4
1 2021-03-30 03:48:23

PySpark 数据帧将异常字符串格式转换为时间戳

问题描述

4 个解决方案

解决方案1 60 已采纳 2016-08-22 21:35:54

解决方案2 15 2018-09-17 14:18:40

解决方案3 1 2019-01-27 15:22:56

解决方案4 1 2021-03-30 03:48:23

解决方案1
60 已采纳 2016-08-22 21:35:54

解决方案2
15 2018-09-17 14:18:40

解决方案3
1 2019-01-27 15:22:56

解决方案4
1 2021-03-30 03:48:23