简体   繁体   English

AWS Athena 错误解释时间戳列

[英]AWS Athena mis-interpreting timestamp column

I'm processing CSV files, outputting parquet files using Pandas in an AWS Lambda function, saving the data to an S3 bucket to query with Athena.我正在处理 CSV 个文件,在 AWS Lambda function 中使用 Pandas 输出镶木地板文件,将数据保存到 S3 存储桶以使用 Athena 进行查询。 The RAW input format to the Lambda function is CSV, with a unix timestamp in UTC that looks like: Lambda function 的 RAW 输入格式为 CSV,UTC 时间戳为 unix,如下所示:

Timestamp,DeviceName,DeviceUUID,SignalName,SignalValueRaw,SignalValueScaled,SignalType,Valid
1605074410110,F2016B1E.CAP.0 - 41840982B40192,323da038-bb49-4f3a-a045-925194364e5b,X.ALM.FLG,0,0,INTEGER,true

I parse the Timestamp like:我像这样解析时间戳:

df['Timestamp'] = pd.to_datetime(df['Timestamp'], unit='ms')
df.head()

    Timestamp               DeviceName                      DeviceUUID                          SignalName  SignalValueRaw  SignalValueScaled   SignalType  SubstationId    StationBankId   FeederId    year    month   day hour    DeviceNameClean DeviceType
0   2020-11-11 06:00:10.110 F2016B2W.MLR.0 - 41841005000073 3c4839b1-ab99-4164-b415-4653948360ef    CVR_X_ENGAGED_A 0   0   BOOLEAN Kenton  FR2016B2    F2016B2W    2020    11  11  6   MLR.0 - 41841005000073  MLR

I process the data further in the Lambda function, then output a parquet file.我在 Lambda function 和 output 一个镶木地板文件中进一步处理数据。 I then run a Glue crawler against the parquet files that this script outputs, and in S3, can query the data fine:然后,我针对此脚本输出的镶木地板文件运行 Glue 爬虫,在 S3 中,可以很好地查询数据:

2020-11-14T05:00:43.609Z,02703ee8-b08a-4c49-9581-706f905aa192,FR22607.REG.0,REG,REG.0,ROSS,FR22607,,0,0,0,0,0,0,0,0,,0.0,,,,0.0,,,,1.0,,

The glue crawler correctly identifies the column as timestamp:胶水爬虫正确地将列识别为时间戳:

CREATE EXTERNAL TABLE `cvr_event_log`(
  `timestamp` timestamp, 
  `deviceuuid` string, 
  `devicename` string, 
  `devicetype` string, 
...

But when I then query the table in Athena, I get this for the date:但是当我随后在 Athena 中查询表时,我得到了这个日期:

"timestamp","deviceuuid","devicename","devicetype",
"+52840-11-19 16:56:55.000","0ca4ed37-930d-4778-b3a8-f49d9b498364","FR22606.REG.0","REG",

What has Athena so confused about the timestamp?是什么让雅典娜对时间戳如此困惑?

For a TIMESTAMP column to work in Athena you need to use a specific format, which unfortunately is not ISO 8601. It looks like this: "2020-11-14 20:33:42".要在 Athena 中使用TIMESTAMP列,您需要使用特定格式,不幸的是,它不是 ISO 8601。它看起来像这样:“2020-11-14 20:33:42”。

You can use from_iso8601_timestamp(ts) to parse ISO 8601 timestamps in queries.您可以使用from_iso8601_timestamp(ts)来解析查询中的 ISO 8601 时间戳。

Glue crawlers sadly misinterprets things quite often and creates tables that don't work properly with Athena.可悲的是,胶水爬虫经常误解事物并创建不能与 Athena 正常工作的表。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM