This is the dataframe schema structure.
root
|-- validFrom: array (nullable = true)
| |-- element: string (containsNull = true)
This is the pyspark query where I try to get the date datatype for validFrom
.
df_02 = spark.sql("""
select to_date(validFrom, 'yyyy-MM-dd') as validFrom
from v_source_df_flatten
""");
But I receive the following error.
due to data type mismatch: argument 1 requires (string or date or timestamp) type, however, 'v_starhist_df_flatten.
validFrom
' is of array type.
What do I have to change in the to_date
function?
You will have to apply to_date
on each element of the array using transform
.
query = """
SELECT transform(validFrom, x -> to_date(x, 'yyyy-MM-dd')) as validFrom
FROM v_source_df_flatten
"""
df_02 = spark.sql(query)
df_02.printSchema()
"""
root
|-- validFrom: array (nullable = true)
| |-- element: date (containsNull = true)
"""
df_02.show(truncate=False)
"""
+------------------------+
|validFrom |
+------------------------+
|[2021-10-10, 2022-10-10]|
|[2021-01-01, 2022-01-01]|
+------------------------+
"""
You can't convert an array of string directly into DateType
. to_date
function expects a string date.
If you have only one date per array, then you can access simply the first element of the array and convert it to date like this:
spark.createDataFrame(
[(["2022-01-01"],), (["2022-01-02"],)], ["validFrom"]
).createOrReplaceTempView("v_source_df_flatten")
df_02 = spark.sql("select to_date(validFrom[0]) as validFrom from v_source_df_flatten")
df_02.printSchema()
#root
# |-- validFrom: date (nullable = true)
df_02.show()
#+----------+
#| validFrom|
#+----------+
#|2022-01-01|
#|2022-01-02|
#+----------+
Notice that you can also use a simple cast as your dates have the default pattern yyyy-MM-dd
:
cast(validFrom[0] as date) as validFrom
However, if your intent is to convert an array of strings into an array of dates, then you can use a cast in this particular case:
df_02 = spark.sql("select cast(validFrom as array<date>) as validFrom from v_source_df_flatten")
df_02.printSchema()
#root
# |-- validFrom: array (nullable = true)
# | |-- element: date (containsNull = true)
df_02.show()
#+------------+
#| validFrom|
#+------------+
#|[2022-01-01]|
#|[2022-01-02]|
#+------------+
For date pattern different from yyyy-MM-dd
you'll have to use transform
on the array and apply to_date
for each element as shown in the other answer.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.