简体   繁体   English

改善apache配置单元的性能

[英]Improve apache hive performance

I have 5GB of data in my HDFS sink. HDFS接收器中有5GB数据。 When I run any query on Hive it takes more than 10-15 minutes to complete. 当我在Hive上运行任何查询时,需要10到15分钟才能完成。 The number of rows I get when I run, 运行时得到的行数,

select count(*) from table_name

is 3,880,900. 是3,880,900。 My VM has 4.5 GB mem and it runs on MBP 2012. I would like to know if creating index in the table will have any performance improvement. 我的VM有4.5 GB的内存,它在MBP 2012上运行。我想知道在表中创建索引是否会改善性能。 Also are there any other ways to tell hive to only use this much amount of data or rows so as to get results faster? 还有其他方法可以告诉配置单元仅使用大量数据或行以更快地获得结果吗? I am ok even if the queries are run for a lesser subset of data at least to get a glimpse of the results. 我没关系,即使查询是针对较少的数据子集运行的,至少也可以使人对结果有所了解。

Yes, indexing should help. 是的,建立索引应该会有所帮助。 However, getting a subset of data (using limit) isn't really helpful as hive still scans the whole data before limiting the output. 但是,获取数据子集(使用限制)并没有太大帮助,因为配置单元仍会在限制输出之前扫描整个数据。

You can try using RCFile/ORCFile format for faster results. 您可以尝试使用RCFile / ORCFile格式以获得更快的结果。 In my experiments, RCFile based tables executed queries roughly 10 times faster than textfile/sequence file based tables. 在我的实验中,基于RCFile的表执行查询的速度大约比基于textfile / sequence文件的表执行查询的速度快10倍。

Depending on the data you are querying you can get gains by using the different file formats like ORC, Parquet. 根据查询的数据,可以使用ORC,Parquet等不同的文件格式来获取收益。 What kind of data are you querying, is it structured or unstructured data? 您要查询哪种数据,是结构化数据还是非结构化数据? What kind of queries are you trying to perform? 您要执行哪种查询? If it is structured data you can see gains also by using other SQL on Hadoop solutions such as InfiniDB, Presto, Impala etc... 如果是结构化数据,则还可以通过在Hadoop解决方案上使用其他SQL(例如InfiniDB,Presto,Impala等)来获得收益。

I am an architect for InfiniDB 我是InfiniDB的架构师
http://infinidb.co http://infinidb.co
SQL on Hadoop solutions like InfiniDB, Impala and others work by you loading your data through them at which they will perform calculations, optimizations etc... to make that data faster to query. Hadoop解决方案(例如InfiniDB,Impala等)上的SQL通过您通过它们加载数据来进行工作,它们将在该处执行计算,优化等……以使数据查询速度更快。 This helps tremendously for interactive analytical queries, especially when compared to something like Hive. 这对于交互式分析查询很有帮助,特别是与Hive之类的东西相比。

With that said, you are working with 5GB of data (but data always grows! someday could be TBs), which is pretty small so you can still work in the worlds of the some of the tools that are not intended for high performance queries. 话虽如此,您正在使用5GB的数据(但数据总是在增长!总有一天可能是TB),这非常小,因此您仍然可以在某些不用于高性能查询的工具中工作。 Your best solution with Hive is to look at how your data is and see if ORC or Parquet could benefit your queries (columnar formats are good for analytic queries). 使用Hive最好的解决方案是查看数据的状态,看看ORC或Parquet是否可以使您的查询受益(列格式非常适合分析查询)。

Hive is always going to be one of the slower options though for performing SQL queries on your HDFS data. Hive始终将是较慢的选项之一,尽管它可以对HDFS数据执行SQL查询。 Hortonworks with their Stinger initiative is making it better, you might want to check that out. Hortonworks与他们的Stinger计划一起使它变得更好,您可能需要检查一下。
http://hortonworks.com/labs/stinger/ http://hortonworks.com/labs/stinger/

The use case sounds fit for ORC, Parquet if you are interested in a subset of the columns. 如果您对列的子集感兴趣,该用例听起来很适合ORC,Parquet。 ORC with hive 0.12 comes with PPD which will help you discarding blocks while running the queries using the meta data that it stores for each column. Hive 0.12的ORC附带了PPD,它将帮助您在运行查询时使用存储在每一列的元数据来丢弃数据块。

We did an implementation on top of hive to support bloom filters in the meta data indexes for ORC files which gave a performance gain of 5-6X. 我们在蜂巢的顶部进行了一个实现,以支持ORC文件的元数据索引中的Bloom过滤器,从而使性能提高了5-6倍。

What is average number of Mapper/Reducer tasks launched for the queries you execute? 为您执行的查询启动的Mapper / Reducer任务平均数量是多少? Tuning some parameters can definitely help. 调整一些参数绝对可以帮助您。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM