简体繁体 English

Spark SQL会完全取代Apache Impala还是Apache Hive？

[英]Will Spark SQL completely replace Apache Impala or Apache Hive?

原文 2016-10-25 09:37:38 6 4 sql/ hadoop/ apache-spark/ hive/ impala

I need to deploy Big Data Cluster on our servers. 我需要在我们的服务器上部署大数据集群。 But I just know about knowledge of Apache Spark. 但我只知道Apache Spark的知识。 Now I need to know whether Spark SQL can completely replace Apache Impala or Apache Hive. 现在我需要知道Spark SQL是否可以完全取代Apache Impala或Apache Hive。

I need your help. 我需要你的帮助。 Thanks. 谢谢。

4 个解决方案

I would like to explain this with real time scenarios 我想用实时场景解释一下

In real time Production projects: 实时生产项目：

Hive is used mostly for storing data/tables and running ad-hoc queries if the organisation is increasing their data day by day and they use RDBMS data for querying then they can use HIVE. 如果组织每天都在增加数据并且他们使用RDBMS数据进行查询，那么Hive主要用于存储数据/表并运行即席查询，然后他们可以使用HIVE。

Impala is used for Business intelligence projects where the reporting is done through some front end tool like tableau, pentaho etc.. Impala用于商业智能项目，其中报告通过一些前端工具完成，如tableau，pentaho等。

and Spark is mostly used in Analytics purpose where the developers are more inclined towards Statistics as they can also use R launguage with spark, for making their initial data frames. 而Spark主要用于分析目的，开发人员更倾向于使用统计数据，因为他们也可以使用R launguage和spark来制作初始数据帧。

So answer to your question is "NO" spark will not replace hive or impala. 所以回答你的问题是“NO”火花不会取代蜂巢或黑斑羚。 because all three have their own use cases and benefits , also ease of implementation these query engines depends on your hadoop cluster setup. 因为这三个都有自己的用例和好处，这些查询引擎也很容易实现，这取决于你的hadoop集群设置。

Here are some links which will help you understand more clearly: 以下是一些有助于您更清楚地了解的链接：

http://db-engines.com/en/system/Hive%3BImpala%3BSpark+SQL http://db-engines.com/en/system/Hive%3BImpala%3BSpark+SQL

http://www.infoworld.com/article/3131058/analytics/big-data-face-off-spark-vs-impala-vs-hive-vs-presto.html http://www.infoworld.com/article/3131058/analytics/big-data-face-off-spark-vs-impala-vs-hive-vs-presto.html

https://www.dezyre.com/article/impala-vs-hive-difference-between-sql-on-hadoop-components/180 https://www.dezyre.com/article/impala-vs-hive-difference-between-sql-on-hadoop-components/180

No. Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. Apache Spark是一种快速通用的大数据处理引擎，内置模块用于流媒体，SQL，机器学习和图形处理。

Impala - open source, distributed SQL query engine for Apache Hadoop. Impala - Apache Hadoop的开源，分布式SQL查询引擎。

Hive - an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Hive - 一种类似SQL的接口，用于查询存储在与Hadoop集成的各种数据库和文件系统中的数据。

Refer: Differences between Hive and impala 参考： Hive和impala之间的差异

Apache Spark has connectors to various data sources and it does processing over the data. Apache Spark具有到各种数据源的连接器，它可以处理数据。 Hive provides a query engine which helps faster querying in Spark when integrated with it. Hive提供了一个查询引擎，可以在与Spark集成时帮助更快地查询Spark。

SparkSQL can use HiveMetastore to get the metadata of the data stored in HDFS. SparkSQL可以使用HiveMetastore来获取存储在HDFS中的数据的元数据。 This metadata enables SparkSQL to do better optimization of the queries that it executes. 此元数据使SparkSQL能够更好地优化其执行的查询。 Here Spark is the query processor. Spark是查询处理器。

Refer: Databricks blog 参考： Databricks博客

This is good question. 这是个好问题。 I think it will not. 我认为不会。 Even though Spark is faster than other two, still each of them have their own purposes and way of working. 尽管Spark比其他两个更快，但它们中的每一个都有自己的目的和工作方式。 For example, for those who familiar with Query language, Hive and Impala will be eaiser for them to use, and Spark can use Hive metastore for better optimization. 例如，对于那些熟悉Query语言的人来说，Hive和Impala将是他们可以使用的，而Spark可以使用Hive Metastore来进行更好的优化。 So , I think it will not compately replace. 所以，我认为它不会被替代。

Apache Impala provides a low-latency access to data and is generally used with front-end business intelligence applications. Apache Impala提供对数据的低延迟访问，通常用于前端商业智能应用程序。

Apache Hive is more suitable for batch processing where query latency isn't a concern. Apache Hive更适合于不考虑查询延迟的批处理。 eg data processing for financial applications based end-of-day attributes (like value of a stock at close of business) 例如，基于日终属性的金融应用程序的数据处理（如业务结束时的股票价值）

While Apache Spark has varied applications from Streaming to Machine Learning, it is also being used for Batch ETL processing. 虽然Apache Spark具有从Streaming到Machine Learning的各种应用程序，但它也用于批量ETL处理。 The enhanced dataset-based Spark SQL API available in Spark 2+ has improved components in the form of Catalyst Query Optimizer and WholeStageCodeGen. Spark 2+中提供的增强的基于数据集的Spark SQL API以Catalyst Query Optimizer和WholeStageCodeGen的形式改进了组件。 I have observed improvements in the order of 50-90% faster execution time for some Hive scripts were translated from HiveQL to Scala on Spark. 我观察到一些Hive脚本从HiveQL转换为Spark上的Scala，执行时间缩短了50-90％。

A few challenges in moving from HiveQL to dataset-based Spark API: 从HiveQL迁移到基于数据集的Spark API的一些挑战：

Lack of a sweet SQL-like syntax present in Hive. Hive中缺少类似SQL的甜蜜语法。
Incomplete integration of the dataset API with Scala language constructs 数据集API与Scala语言结构的不完整集成
Lack of compile time error reporting in some dataset operations 在某些数据集操作中缺少编译时错误报告