简体   繁体   English

ATHENA CREATE TABLE AS 镶木地板格式问题

[英]ATHENA CREATE TABLE AS problem with parquet format

I'm creating a table in Athena and specifying the format as PARQUET however the file extension is not being recognized in S3.我在 Athena 中创建一个表并将格式指定为 PARQUET,但是文件扩展名在 S3 中未被识别。 The type is displayed as "-" which means that the file extension is not recognized despite that I can read the files (written from Athena) successfully in a Glue job using:类型显示为“-”,这意味着文件扩展名无法识别,尽管我可以使用以下方法在 Glue 作业中成功读取文件(从 Athena 编写):

df = spark.read.parquet()

Here is my statement:这是我的陈述:

CREATE EXTERNAL TABLE IF NOT EXISTS test (
    numeric_field INT
    ,numeric_field2 INT)

STORED AS PARQUET
LOCATION 's3://xxxxxxxxx/TEST TABLE/'
TBLPROPERTIES ('classification'='PARQUET');   
    
INSERT INTO test
VALUES (10,10),(20,20);

I'm specifying the format as PARQUET but when I check in the S3 bucket the file type is displayed as "-".我将格式指定为 PARQUET,但当我签入 S3 存储桶时,文件类型显示为“-”。 Also when I check the glue catalog, that table type is set as 'unknown'此外,当我检查胶水目录时,该表类型设置为“未知”

S3 STORAGE PRINT SCREEN S3 存储打印屏幕

I expected that the type is recognized as "parquet" in the S3 bucket我希望该类型在 S3 存储桶中被识别为“parquet”

After contacting the AWS support, it was confirmed that with CTAS queries Athena does not create file extensions for parquet files.联系 AWS 支持后,确认 Athena 不会为 parquet 文件创建文件扩展名。 "Further to confirm this, I do see the Knowledge Center article [1] where CTAS generates the Parquet files without extension ( Under section 'Convert the data format and set the approximate file size' Point 5)." “进一步证实这一点,我确实看到了知识中心文章 [1],其中 CTAS 生成了没有扩展名的 Parquet 文件(在“转换数据格式并设置近似文件大小”部分第 5 点下)。”

However the files written from Athena, even without the extension are readable.然而,即使没有扩展名,从 Athena 写入的文件也是可读的。

Reference: [1] https://aws.amazon.com/premiumsupport/knowledge-center/set-file-number-size-ctas-athena/参考:[1] https://aws.amazon.com/premiumsupport/knowledge-center/set-file-number-size-ctas-athena/

Workaround: I created a function to change the file extension.解决方法:我创建了一个 function 来更改文件扩展名。 Basically iterating over the files in the S3 bucket and then writing the contents back to the same location with parquet file extension基本上遍历 S3 存储桶中的文件,然后将内容写回具有镶木地板文件扩展名的相同位置

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM