简体繁体 English

Hadoop（DSE版本）

[英]Comparing Cassandra's CQL vs Spark/Shark queries vs Hive/Hadoop (DSE version)

原文 2013-06-14 17:18:44 9 2 cassandra/ hive/ cql/ apache-spark/ shark-sql

I would like to hear your thoughts and experiences on the usage of CQL and in-memory query engine Spark/Shark. 我想听听您对使用CQL和内存查询引擎Spark / Shark的想法和经验。 From what I know, CQL processor is running inside Cassandra JVM on each node. 据我所知，CQL处理器在每个节点上的Cassandra JVM中运行。 Shark/Spark query processor attached with a Cassandra cluster is running outside in a separated cluster. 与Cassandra集群连接的Shark / Spark查询处理器在一个独立的集群中运行。 Also, Datastax has DSE version of Cassandra which allows to deploy Hadoop/Hive. 此外，Datastax拥有Cassandra的DSE版本，允许部署Hadoop / Hive。 The question is in which use case we would pick a specific solution instead of the other. 问题是在哪个用例中我们会选择一个特定的解决方案而不是另一个。

2 个解决方案

I will share a few thoughts based on my experience. 我将根据我的经验分享一些想法。 But, if possible for you, please let us know about your use-case. 但是，如果可能，请告诉我们您的用例。 It'll help us in answering your queries in a better manner. 它将帮助我们以更好的方式回答您的问题。

1- If you are going to have more writes than reads, Cassandra is obviously a good choice. 1-如果你的写作数量多于读数，那么Cassandra显然是一个不错的选择。 Having said that, if you are coming from SQL background and planning to use Cassandra then you'll definitely find CQL very helpful. 话虽如此，如果您来自SQL背景并计划使用Cassandra，那么您肯定会发现CQL非常有用。 But if you need to perform operations like JOIN and GROUP BY, even though CQL solves primitive GROUP BY use cases through write time and compact time sorts and implements one-to-many relationships, CQL is not the answer. 但是如果你需要执行JOIN和GROUP BY等操作，即使CQL通过写入时间和紧凑时间排序来解决原始GROUP BY用例并实现一对多关系，CQL也不是答案。

2- Spark SQL (Formerly Shark) is very fast for the two reasons, in-memory processing and planning data pipelines. 2- Spark SQL（以前称为Shark）由于两个原因（内存处理和规划数据管道）非常快。 In-memory processing makes it ~100x faster than Hive. 内存处理使其比Hive快约100倍。 Like Hive, Spark SQL handles larger than memory data types very well and up to 10x faster thanks to planned pipelines. 与Hive一样，Spark SQL可以很好地处理大于内存的数据类型，并且由于计划的管道，速度提高了10倍。 Situation shifts to Spark SQL benefit when multiple data pipelines like filter and groupBy are present. 当存在多个数据管道（如filter和groupBy）时，情境会转移到Spark SQL优势。 Go for it when you need ad-hoc real time querying. 当您需要临时实时查询时，请继续使用它。 Not suitable when you need long running jobs over gigantic amounts of data. 当您需要长时间运行的作业而不是大量的数据时，这种做法并不合适。

3- Hive is basically a warehouse that runs on top of your existing Hadoop cluster and provides you SQL like interface to handle your data. 3- Hive基本上是一个在现有Hadoop集群之上运行的仓库，为您提供类似SQL的界面来处理您的数据。 But Hive is not suitable for real-time needs. 但是Hive并不适合实时需求。 It is best suited for offline batch processing. 它最适合离线批处理。 Doesn't need any additional infra as it uses underlying HDFS for data storage. 不需要任何额外的infra，因为它使用底层HDFS进行数据存储。 Go for it when you have to perform operations like JOIN, GROUP BY etc on large dataset and for OLAP. 当您必须在大型数据集和OLAP上执行JOIN，GROUP BY等操作时，请执行此操作。

Note : Spark SQL emulates Apache Hive behavior on top of Spark, so it supports virtually all Hive features but potentially faster. Note : Spark SQL在Spark之上模拟Apache Hive行为，因此它几乎支持所有Hive功能，但可能更快。 It supports the existing Hive Query language, Hive data formats (SerDes), user-defined functions (UDFs), and queries that call external scripts. 它支持现有的Hive Query语言，Hive数据格式（SerDes），用户定义的函数（UDF）以及调用外部脚本的查询。

But I think you will be able to evaluate the pros and cons of all these tools properly only after getting your hands dirty. 但我认为只有在弄脏手之后，您才能正确评估所有这些工具的优缺点。 I could just suggest based on your questions. 我可以根据你的问题提出建议。

Hope this answers some of your queries. 希望这可以回答您的一些疑问。

PS : The above answer is based on solely my experience. PS：以上答案仅基于我的经验。 Comments/corrections are welcome. 欢迎提出意见/更正。

这里记录的基准测试非常好 - https://amplab.cs.berkeley.edu/benchmark/