PySpark删除列中的无效日期时间格式

Question

My date time field format is: 2016-10-15 00:00:00 after using infer schema while saving my data to a parquet file, I have a few rows that don't comply to this format. 我的日期时间字段格式为：2016-10-15 00:00:00使用推断架构，同时将我的数据保存到镶木地板文件，我有几行不符合此格式。

How can I collectively remove them in PySpark? 如何在PySpark中集体删除它们？

It is causing me problems in my UDF's. 它在我的UDF中引起了我的问题。

Answer 1

Assuming you're parsing the date column and rows with invalid dates are null, which is usually the case: 假设您正在解析日期列，并且具有无效日期的行为空，通常是这种情况：

df.filter(col('date').isNotNull())

Alternatively, if your date is read as a string, you can parse it using unix_timestamp : 或者，如果您的日期作为字符串读取，您可以使用unix_timestamp解析它：

(
    df
    .select(unix_timestamp('date', 'yyyy-MM-dd HH:mm:ss').cast("timestamp").alias('date'))
    .filter(col('date').isNotNull())
)

PySpark删除列中的无效日期时间格式

问题描述

1 个解决方案

解决方案1
0 2017-01-02 22:31:21

PySpark删除列中的无效日期时间格式

问题描述

1 个解决方案

解决方案1 0 2017-01-02 22:31:21

解决方案1
0 2017-01-02 22:31:21