简体   繁体   English

在S3分区上进行Hive查询太慢

[英]Hive query on s3 partition is too slow

I have partitioned the data by date and here is how it is stored in s3. 我已经按日期对数据进行了分区,这就是它在s3中的存储方式。

s3://dataset/date=2018-04-01
s3://dataset/date=2018-04-02
s3://dataset/date=2018-04-03
s3://dataset/date=2018-04-04
...

Created hive external table on top of this. 在此之上创建hive外部表。 I am executing this query, 我正在执行此查询,

select count(*) from dataset where `date` ='2018-04-02' 

This partition has two parquet files like this, 这个分区有两个这样的实木复合地板文件,

part1 -xxxx- .snappy.parquet
part2 -xxxx- .snappy.parquet

each file size is 297MB. 每个文件大小为297MB. , So not a big file and not many files to scan. ,所以不是大文件,也不是要扫描的文件很多。

And the query is returning 12201724 records. 查询将返回12201724记录。 However it takes 3.5 mins to return this, since one partition itself is taking this time, running even the count query on whole dataset ( 7 years ) of data takes hours to return the results. 但是,返回此结果需要3.5分钟,因为一个分区本身正在占用此时间,因此即使在整个数据集(7年)的数据上运行计数查询也要花费数小时才能返回结果。 Is there anyway, I can speed up this ? 无论如何,我可以加快速度吗?

Amazon Athena is, effectively, a managed Presto service. Amazon Athena实际上是托管的Presto服务。 It can query data stored in Amazon S3 without having to run any clusters. 它可以查询存储在Amazon S3中的数据,而无需运行任何集群。

It is charged based upon the amount of data read from disk, so it runs very efficiently when using partitions and parquet files. 它是根据从磁盘读取的数据量来收费的,因此在使用分区和镶木地板文件时,它可以非常高效地运行。

See: Analyzing Data in S3 using Amazon Athena | 请参阅: 使用Amazon Athena分析S3中的数据| AWS Big Data Blog AWS大数据博客

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM