简体   繁体   English

AWS Athena中的文件系统上缺少表

[英]Tables missing on filesystem in AWS Athena

I've created a table with auto partitioning with this code on Athena. 我已经在Athena上使用此代码创建了一个具有自动分区的表。

CREATE EXTERNAL TABLE IF NOT EXISTS matchdata.stattable (
  `matchResult` string,
  ...
) PARTITIONED BY (
  year int ,
  month int,
  day int
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
) LOCATION 's3://******/data/year=2019/month=8/day=2/'
TBLPROPERTIES ('has_encrypted_data'='false');

and I ran MSCK REPAIR TABLE stattable , but got Tables missing on filesystem and query result is zero records returned . 我运行了MSCK REPAIR TABLE stattable ,但是文件系统上的表丢失了,查询结果返回了零条记录 matchdata.stattable gets same result. matchdata.stattable得到相同的结果。

Another table without partitioning, the query works fine. 另一个没有分区的表,查询工作正常。 But as the service continues and dataset gets grow, I must go with partitioning. 但是随着服务的继续和数据集的增长,我必须进行分区。

The example data path is data/2019/8/2/1SxFHaUeHfesLtPs._BjDk.gz. 示例数据路径为data / 2019/8/2 / 1SxFHaUeHfesLtPs._BjDk.gz。 How can I settle this issue? 我该如何解决这个问题?

I solved this problem by renaming s3 files' prefix. 我通过重命名s3文件的前缀解决了这个问题。

You can't actually rename or move file in s3 directly. 您实际上不能直接在s3中重命名或移动文件。 By mv command, you should create another key and delete existing one. 通过mv命令,您应该创建另一个密钥并删除现有的密钥。

By run this code on console, you can make Hive can understand location of partition. 通过在控制台上运行此代码,可以使Hive可以了解分区的位置。

aws s3 --recursive mv s3://***/data/2019/8/7/ s3://***/data/year=2019/month=8/day=7/

As you've discovered (but with some more context for the people having the same issue) MSCK REPAIR TABLE … only understands Hive style partitioning, eg /data/year=2019/month=08/day=10/file.json . 正如您所发现的(但对于遇到相同问题的人员还有更多上下文), MSCK REPAIR TABLE …仅了解Hive样式分区,例如/data/year=2019/month=08/day=10/file.json What the command really does is scan through the prefix on S3 corresponding to the table's LOCATION directive and look for path components that look like that. 该命令的真正作用是扫描S3上与表的LOCATION指令相对应的前缀,并查找类似的路径组件。

This is just a limitation with MSCK REPAIR TABLE … , you can manually add partitions with other path styles like this: 这只是MSCK REPAIR TABLE …的限制,您可以手动添加具有其他路径样式的分区,如下所示:

ALTER TABLE the_table ADD PARTITION (year = '2019', month = '08', day = '10') LOCATION 's3://some-bucket/data/2019/08/10/'

Also see https://docs.aws.amazon.com/athena/latest/ug/alter-table-add-partition.html 另请参阅https://docs.aws.amazon.com/athena/latest/ug/alter-table-add-partition.html

I would go so far as to say that you should avoid using MSCK REPAIR TABLE … altogether. 我要说的是,您应该避免完全使用MSCK REPAIR TABLE … It's slow, and only gets slower the more partitions you have. 它很慢,并且分区越多,速度就越慢。 It's much more efficient to run ALTER TABLE … ADD PARTITION … when you add new data on S3, because you know what you just added and where it is, so telling Athena to scan through your whole prefix is unnecessary. 在S3上添加新数据时,运行ALTER TABLE … ADD PARTITION …效率更高,因为您知道刚刚添加的内容以及它的位置,因此不需要Athena扫描整个前缀。 Even faster is using the Glue API directly, but that's more code, unfortunately. 直接使用Glue API甚至更快,但是不幸的是,这是更多的代码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM