ISO 日期字段在从 MONGO 读取 PySpark 时自动转换

Question

I am trying to read a date field from MongoDB collection to PySpark df.我正在尝试从 MongoDB 集合读取日期字段到 PySpark df。 My date field has an ISO format when seen in mongo DB but it gets converted into a different type after reading on Spark.在 mongo DB 中看到我的日期字段具有 ISO 格式，但在 Spark 上读取后它被转换为不同的类型。

 In Mongo the date looks like below

 ISODate("2012-07-14T01:00:00+01:00")

 df =  (sqlContext.read.format("com.xyz.datasource.mongodb").options(host="mongo:XXX",database="foo", collection="bar").load())
 df.show()

My date column gets converted like below:我的日期列转换如下：

{ "$date": 1.62345674343 }

I understand this got converted to epoch and I have an UDF which converts to human readable timestamp but on why this happens?我知道这已转换为纪元，并且我有一个 UDF 可以转换为人类可读的时间戳，但是为什么会发生这种情况？ Is there a fix which avoids or ignores my UDF (I would like to not apply UDF on columns)?是否有避免或忽略我的 UDF 的修复程序（我不想在列上应用 UDF）？

I have multiple createdAt fields which I like to change.我有多个我喜欢更改的createdAt字段。

Answer 1

spark only supports epoch seconds and not milliseconds, sometimes data coming out of mongo has epoch milliseconds. spark只支持纪元秒而不是毫秒，有时从mongo出来的数据有纪元毫秒。 In that case you would divide the epoch milliseconds integer by a thousand.在这种情况下，您可以将纪元毫秒 integer 除以一千。

from pyspark.sql.types import *

df = sqlContext.read.format(
        "com.xyz.datasource.mongodb"
    ).options(
        host="mongo:XXX",
        database="foo", 
        collection="bar"
    ).load().withColumn(
        'date', 
         col('exposure.knownEmployeesExposed.latestIncidentReport.createdAt').cast('timestamp')
    )

ISO 日期字段在从 MONGO 读取 PySpark 时自动转换

问题描述

1 个解决方案

解决方案1
1 2020-12-17 14:17:18

ISO 日期字段在从 MONGO 读取 PySpark 时自动转换

问题描述

1 个解决方案

解决方案1 1 2020-12-17 14:17:18

解决方案1
1 2020-12-17 14:17:18