如何在具有不同日期格式的列上将字符串转换为日期

Question

I have a column in my Spark DataFrame, open_date with string type values as below which are two different formats yyyymmdd and yyyymm 我在我的星火据帧列， open_date与字符串类型的值，如下这是两种不同的格式yyyymmdd和yyyymm

+---------+
|open_date|
+---------+
| 19500102| 
|   195001| 
+---------+

and my expected output is 我的预期输出是

+----------+
| open_date|
+----------+
|1950-01-02|
|1950-01-01|
+----------+

I tried converting this string to date format using pyspark.sql.functions.substr , pyspark.sql.functions.split and pyspark.sql.functions.regex_extract . 我尝试使用pyspark.sql.functions.substr ， pyspark.sql.functions.split和pyspark.sql.functions.regex_extract将字符串转换为日期格式。 Having limited knowledge on these, none of them succeeded. 由于对这些知识的了解有限，他们都没有成功。

How can I convert string to date type on a column with different formats? 如何在不同格式的列上将字符串转换为日期类型？

Answer 1

You can require that the yyyy and mm are present, but make dd optional. 您可以要求存在yyyy和mm ，但应使dd可选。 Break each into their own capture group, filter out if the dd is missing, then join using '-' delimiters. 将它们分成自己的捕获组，如果缺少dd则filter掉，然后使用'-'分隔符join 。

>>> import re
>>> s = '19500102 195001'
>>> ['-'.join(filter(None, i)) for i in re.findall(r'(\d{4})(\d{2})(\d{2})?', s)]
['1950-01-02', '1950-01']

Answer 2

Update 2019-06-24 更新 2019-06-24

You can try each of the valid date formats and use pyspark.sql.functions.coalesce to return the first non-null result. 您可以尝试每种有效的日期格式，然后使用pyspark.sql.functions.coalesce返回第一个非空结果。

import pyspark.sql.functions as f

def date_from_string(date_str, fmt):
    try:
        # For spark version 2.2 and above, to_date takes in a second argument
        return f.to_date(date_str, fmt).cast("date")
    except TypeError:
        # For spark version 2.1 and below, you'll have to do it this way
        return f.from_unixtime(f.unix_timestamp(date_str, fmt)).cast("date")

possible_date_formats = ["yyyyMMdd", "yyyyMM"]

df = df.withColumn(
    "open_date",
    f.coalesce(*[date_from_string("open_date", fmt) for fmt in possible_date_formats])
)

df.show()
#+----------+
#| open_date|
#+----------+
#|1950-01-02|
#|1950-01-01|
#+----------+

Original Answer 原始答案

If you're guaranteed to only have strings that are 6 or 8 characters in length, the simplest thing would be to append "01" to the end of the short strings to specify the first of the month. 如果保证只有长度为6或8个字符的字符串，则最简单的方法是在短字符串的末尾附加"01"以指定月份的第一天。

Here is an example using pyspark.sql.functions.length() and pyspark.sql.functions.concat() : 这是使用pyspark.sql.functions.length()和pyspark.sql.functions.concat()的示例：

import pyspark.sql.functions as f

df = df.withColumn(
    'open_date',
    f.when(
        f.length(f.col('open_date')) == 6,
        f.concat(f.col('open_date'), "01")
    ).otherwise(f.col('open_date'))
)
df.show()
#+---------+
#|open_date|
#+---------+
#| 19500102| 
#| 19500101| 
#+---------+

Then use the techniques described in this post (paraphrased below) to convert to a date. 然后使用中描述的技术这个帖子（以下转述）转换为一个日期。

For Spark 2.1 and below : 对于Spark 2.1及以下版本 ：

df = df.withColumn('open_date', f.from_unixtime(f.unix_timestamp('open_date', 'yyyyMMdd')))

For Spark 2.2+ 对于Spark 2.2+

df = df.withColumn('open_date', f.to_date('open_date', 'yyyyMMdd'))

如何在具有不同日期格式的列上将字符串转换为日期

问题描述

2 个解决方案

解决方案1
2 2018-05-14 18:50:28

解决方案2
0 2018-05-14 19:16:14

如何在具有不同日期格式的列上将字符串转换为日期

问题描述

2 个解决方案

解决方案1 2 2018-05-14 18:50:28

解决方案2 0 2018-05-14 19:16:14

解决方案1
2 2018-05-14 18:50:28

解决方案2
0 2018-05-14 19:16:14