简体   繁体   English

如何在 pyspark 中将字符串数组转换为日期?

[英]How to convert array of string to date in pyspark?

This is the dataframe schema structure.这是 dataframe 模式结构。

root
 |-- validFrom: array (nullable = true)
 |    |-- element: string (containsNull = true)

This is the pyspark query where I try to get the date datatype for validFrom .这是 pyspark 查询,我尝试在其中获取validFrom的日期数据类型。

df_02 = spark.sql("""
 select to_date(validFrom, 'yyyy-MM-dd') as  validFrom
 from   v_source_df_flatten
""");

But I receive the following error.但我收到以下错误。

due to data type mismatch: argument 1 requires (string or date or timestamp) type, however, 'v_starhist_df_flatten.由于数据类型不匹配:参数 1 需要(字符串或日期或时间戳)类型,但是,'v_starhist_df_flatten. validFrom ' is of array type. validFrom ' 是数组类型。

What do I have to change in the to_date function?我必须在to_date中更改什么?

You will have to apply to_date on each element of the array using transform .您必须使用transform对数组的每个元素应用to_date

query = """
SELECT transform(validFrom, x -> to_date(x, 'yyyy-MM-dd')) as validFrom
FROM v_source_df_flatten
"""

df_02 = spark.sql(query)

df_02.printSchema()

"""
root
 |-- validFrom: array (nullable = true)
 |    |-- element: date (containsNull = true)
"""

df_02.show(truncate=False)

"""
+------------------------+
|validFrom               |
+------------------------+
|[2021-10-10, 2022-10-10]|
|[2021-01-01, 2022-01-01]|
+------------------------+
"""

You can't convert an array of string directly into DateType .您不能将字符串数组直接转换为DateType to_date function expects a string date. to_date function 需要一个字符串日期。

If you have only one date per array, then you can access simply the first element of the array and convert it to date like this:如果每个数组只有一个日期,那么您可以简单地访问数组的第一个元素并将其转换为日期,如下所示:

spark.createDataFrame(
    [(["2022-01-01"],), (["2022-01-02"],)], ["validFrom"]
).createOrReplaceTempView("v_source_df_flatten")

df_02 = spark.sql("select to_date(validFrom[0]) as validFrom from v_source_df_flatten")

df_02.printSchema()
#root
# |-- validFrom: date (nullable = true)

df_02.show()
#+----------+
#| validFrom|
#+----------+
#|2022-01-01|
#|2022-01-02|
#+----------+

Notice that you can also use a simple cast as your dates have the default pattern yyyy-MM-dd :请注意,您还可以使用简单的转换,因为您的日期具有默认模式yyyy-MM-dd

cast(validFrom[0] as date) as validFrom

However, if your intent is to convert an array of strings into an array of dates, then you can use a cast in this particular case:但是,如果您的意图是将字符串数组转换为日期数组,那么您可以在这种特殊情况下使用强制转换:

df_02 = spark.sql("select cast(validFrom as array<date>) as validFrom from v_source_df_flatten")

df_02.printSchema()
#root
# |-- validFrom: array (nullable = true)
# |    |-- element: date (containsNull = true)

df_02.show()
#+------------+
#|   validFrom|
#+------------+
#|[2022-01-01]|
#|[2022-01-02]|
#+------------+

For date pattern different from yyyy-MM-dd you'll have to use transform on the array and apply to_date for each element as shown in the other answer.对于不同于yyyy-MM-dd日期模式,您必须在数组上使用transform并为每个元素应用to_date ,如另一个答案所示。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM