[英]How to convert string to date on a column with different date formats
I have a column in my Spark DataFrame, open_date
with string type values as below which are two different formats yyyymmdd
and yyyymm
我在我的星火据帧列,
open_date
与字符串类型的值,如下这是两种不同的格式yyyymmdd
和yyyymm
+---------+
|open_date|
+---------+
| 19500102|
| 195001|
+---------+
and my expected output is 我的预期输出是
+----------+
| open_date|
+----------+
|1950-01-02|
|1950-01-01|
+----------+
I tried converting this string to date format using pyspark.sql.functions.substr
, pyspark.sql.functions.split
and pyspark.sql.functions.regex_extract
. 我尝试使用
pyspark.sql.functions.substr
, pyspark.sql.functions.split
和pyspark.sql.functions.regex_extract
将字符串转换为日期格式。 Having limited knowledge on these, none of them succeeded. 由于对这些知识的了解有限,他们都没有成功。
How can I convert string to date type on a column with different formats? 如何在不同格式的列上将字符串转换为日期类型?
You can require that the yyyy
and mm
are present, but make dd
optional. 您可以要求存在
yyyy
和mm
,但应使dd
可选。 Break each into their own capture group, filter
out if the dd
is missing, then join
using '-'
delimiters. 将它们分成自己的捕获组,如果缺少
dd
则filter
掉,然后使用'-'
分隔符join
。
>>> import re
>>> s = '19500102 195001'
>>> ['-'.join(filter(None, i)) for i in re.findall(r'(\d{4})(\d{2})(\d{2})?', s)]
['1950-01-02', '1950-01']
Update 2019-06-24 更新 2019-06-24
You can try each of the valid date formats and use pyspark.sql.functions.coalesce
to return the first non-null result. 您可以尝试每种有效的日期格式,然后使用
pyspark.sql.functions.coalesce
返回第一个非空结果。
import pyspark.sql.functions as f
def date_from_string(date_str, fmt):
try:
# For spark version 2.2 and above, to_date takes in a second argument
return f.to_date(date_str, fmt).cast("date")
except TypeError:
# For spark version 2.1 and below, you'll have to do it this way
return f.from_unixtime(f.unix_timestamp(date_str, fmt)).cast("date")
possible_date_formats = ["yyyyMMdd", "yyyyMM"]
df = df.withColumn(
"open_date",
f.coalesce(*[date_from_string("open_date", fmt) for fmt in possible_date_formats])
)
df.show()
#+----------+
#| open_date|
#+----------+
#|1950-01-02|
#|1950-01-01|
#+----------+
Original Answer 原始答案
If you're guaranteed to only have strings that are 6 or 8 characters in length, the simplest thing would be to append "01"
to the end of the short strings to specify the first of the month. 如果保证只有长度为6或8个字符的字符串,则最简单的方法是在短字符串的末尾附加
"01"
以指定月份的第一天。
Here is an example using pyspark.sql.functions.length()
and pyspark.sql.functions.concat()
: 这是使用
pyspark.sql.functions.length()
和pyspark.sql.functions.concat()
的示例:
import pyspark.sql.functions as f
df = df.withColumn(
'open_date',
f.when(
f.length(f.col('open_date')) == 6,
f.concat(f.col('open_date'), "01")
).otherwise(f.col('open_date'))
)
df.show()
#+---------+
#|open_date|
#+---------+
#| 19500102|
#| 19500101|
#+---------+
Then use the techniques described in this post (paraphrased below) to convert to a date. 然后使用中描述的技术这个帖子 (以下转述)转换为一个日期。
For Spark 2.1 and below : 对于Spark 2.1及以下版本 :
df = df.withColumn('open_date', f.from_unixtime(f.unix_timestamp('open_date', 'yyyyMMdd')))
For Spark 2.2+ 对于Spark 2.2+
df = df.withColumn('open_date', f.to_date('open_date', 'yyyyMMdd'))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.