简体繁体 English

Cassandra表分析方法？

[英]Cassandra table analytics approaches?

原文 2016-05-24 12:43:29 3 3 java/ apache-spark/ solr/ cassandra/ ignite

I have a requerement to do a realtime filtration and sorting over a relatively big partition in a C* table ~ 2-3 billion rows with over a hundred columns in each. 我需要进行实时过滤，并在C *表中的一个相对较大的分区上进行排序，该分区大约有2-3亿行，每行有一百多列。 It should be possible to filter and sort over any combination of the columns. 应该可以对列的任何组合进行过滤和排序。 We tried Apache Solr (DataStax Enterprise 4.8) for that kind of a job but faced next issues: 我们尝试使用Apache Solr（DataStax Enterprise 4.8）来完成此类工作，但面临下一个问题：

Solr indexes work bad in case of frequent and bulk data updates Solr索引在频繁和批量数据更新的情况下效果不佳
Sometimes Solr just don't rebuild the indexes (waited for hours) 有时Solr只是不重建索引（等待了几个小时）
Solr can read only with CL=ONE, so data can be inconsistent Solr只能以CL = ONE读取，因此数据可能不一致

So now we look for another approaches. 因此，现在我们寻找另一种方法。 We're trying Apache Spark 1.4 for now. 我们现在正在尝试Apache Spark 1.4。 But looks like the sorting performance is not satisfying - about 1.5 min for 2 bln rows (our target ~ 1-2 sec). 但是看起来排序性能并不令人满意-20亿行大约1.5分钟（我们的目标〜1-2秒）。 May be we're doing something wrong as we are at the very begining of the Spark learning. 可能是因为我们在Spark学习的一开始就做错了。 Also I understand that the performance may be better with more processor cores and memory. 我也知道，更多的处理器内核和内存可能会提高性能。

Today I've read about Apache Inginte with in-memory indexing. 今天，我阅读了有关带有内存索引的Apache Inginte的信息。 Probably it is better tool for our case? 也许这对于我们的案例来说是更好的工具？

So now I'm just looking for suggestion of a tool to perform such a job. 所以现在我只是在寻找一种建议的工具来执行这样的工作。

Thanks. 谢谢。

ps: DataStax Enterprise 4.8, Apache Cassandra 2.1.9.791, Apache Solr 4.10.3.1.172, Apache Spark 1.4.1.1. ps：DataStax Enterprise 4.8，Apache Cassandra 2.1.9.791，Apache Solr 4.10.3.1.172，Apache Spark 1.4.1.1。

3 个解决方案

I think your approaches are best you can get. 我认为您的方法是最好的。 Either Spark (eg SparkSQL) or an in-memory data grid like Ignite. Spark（例如SparkSQL）或内存中的数据网格（如Ignite）。 Both will do the same - push the whole stuff into memory and shuffle and dice the data. 两者都将执行相同的操作-将整个内容推送到内存中，并对数据进行随机整理和切块。 http://velvia.github.io/Subsecond-Joins-in-Spark-Cassandra/ Flink is another option to consider, but nothing really different from Spark. http://velvia.github.io/Subsecond-Joins-in-Spark-Cassandra/ Flink是可以考虑的另一种选择，但与Spark并没有什么不同。

On the other hand, 2-3 billion of rows should fit a Postgres DB or something similar. 另一方面，2-3亿行应该适合Postgres DB或类似的东西。 Check if it's not enough for you. 检查是否还不够。

In the Hadoop world, you have Hive (sloow and steady) or Impala (faster and memory heavy) or Spark again. 在Hadoop世界中，您又拥有Hive（缓慢且稳定）或Impala（更快且内存更多）或Spark。 But these won't work well with Cassandra. 但是这些对Cassandra来说效果不佳。 And I don't believe that your data is big enough to consider Hadoop environment (maintenance cost). 而且我不认为您的数据足够考虑Hadoop环境（维护成本）。

Sorry but sorting on 2bln rows with over a hundred columns in 2 seconds. 抱歉，但在2秒内对超过100列的20亿行进行了排序。 I think this would be a big challange. 我认为这将是一个巨大的挑战。 I mean you have 200bln columns. 我的意思是您有2000亿列。 Recommended is a maximum of 2bln per partition key. 建议每个分区键最多20亿个。 And i thinks 2bln per partition is too much. 而且我认为每个分区20亿美元太多了。 If you want a better spark performance, you have to find the bottleneck. 如果想要更好的火花性能，则必须找到瓶颈。 Can you write a bit more about your setup? 您能否再详细介绍一下您的设置？ How many cassandra nodes do you have? 您有多少个卡桑德拉节点？ How many Spark nodes? 有多少个Spark节点？ Hardware Specs? 硬件规格？

Apache Ignite has full SQL support with indexes which you can use to improve the performance in your case. Apache Ignite对索引提供了完全的SQL支持，您可以使用索引来提高性能。 I would definitely try it. 我一定会尝试的。

Refer to this page for details: https://apacheignite.readme.io/docs/sql-queries 请参阅此页面以获取详细信息： https : //apacheignite.readme.io/docs/sql-queries