简体   繁体   English

ISO 日期字段在从 MONGO 读取 PySpark 时自动转换

[英]ISO date field converts automatically while reading in PySpark from MONGO

I am trying to read a date field from MongoDB collection to PySpark df.我正在尝试从 MongoDB 集合读取日期字段到 PySpark df。 My date field has an ISO format when seen in mongo DB but it gets converted into a different type after reading on Spark.在 mongo DB 中看到我的日期字段具有 ISO 格式,但在 Spark 上读取后它被转换为不同的类型。

 In Mongo the date looks like below

 ISODate("2012-07-14T01:00:00+01:00")

 df =  (sqlContext.read.format("com.xyz.datasource.mongodb").options(host="mongo:XXX",database="foo", collection="bar").load())
 df.show()

My date column gets converted like below:我的日期列转换如下:

{ "$date": 1.62345674343 }

I understand this got converted to epoch and I have an UDF which converts to human readable timestamp but on why this happens?我知道这已转换为纪元,并且我有一个 UDF 可以转换为人类可读的时间戳,但是为什么会发生这种情况? Is there a fix which avoids or ignores my UDF (I would like to not apply UDF on columns)?是否有避免或忽略我的 UDF 的修复程序(我不想在列上应用 UDF)?

I have multiple createdAt fields which I like to change.我有多个我喜欢更改的createdAt字段。

图式

spark only supports epoch seconds and not milliseconds, sometimes data coming out of mongo has epoch milliseconds. spark只支持纪元秒而不是毫秒,有时从mongo出来的数据有纪元毫秒。 In that case you would divide the epoch milliseconds integer by a thousand.在这种情况下,您可以将纪元毫秒 integer 除以一千。

from pyspark.sql.types import *

df = sqlContext.read.format(
        "com.xyz.datasource.mongodb"
    ).options(
        host="mongo:XXX",
        database="foo", 
        collection="bar"
    ).load().withColumn(
        'date', 
         col('exposure.knownEmployeesExposed.latestIncidentReport.createdAt').cast('timestamp')
    )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM