简体   繁体   中英

Cassandra table analytics approaches?

I have a requerement to do a realtime filtration and sorting over a relatively big partition in a C* table ~ 2-3 billion rows with over a hundred columns in each. It should be possible to filter and sort over any combination of the columns. We tried Apache Solr (DataStax Enterprise 4.8) for that kind of a job but faced next issues:

  • Solr indexes work bad in case of frequent and bulk data updates
  • Sometimes Solr just don't rebuild the indexes (waited for hours)
  • Solr can read only with CL=ONE, so data can be inconsistent

So now we look for another approaches. We're trying Apache Spark 1.4 for now. But looks like the sorting performance is not satisfying - about 1.5 min for 2 bln rows (our target ~ 1-2 sec). May be we're doing something wrong as we are at the very begining of the Spark learning. Also I understand that the performance may be better with more processor cores and memory.

Today I've read about Apache Inginte with in-memory indexing. Probably it is better tool for our case?

So now I'm just looking for suggestion of a tool to perform such a job.

Thanks.

ps: DataStax Enterprise 4.8, Apache Cassandra 2.1.9.791, Apache Solr 4.10.3.1.172, Apache Spark 1.4.1.1.

I think your approaches are best you can get. Either Spark (eg SparkSQL) or an in-memory data grid like Ignite. Both will do the same - push the whole stuff into memory and shuffle and dice the data. http://velvia.github.io/Subsecond-Joins-in-Spark-Cassandra/ Flink is another option to consider, but nothing really different from Spark.

On the other hand, 2-3 billion of rows should fit a Postgres DB or something similar. Check if it's not enough for you.

In the Hadoop world, you have Hive (sloow and steady) or Impala (faster and memory heavy) or Spark again. But these won't work well with Cassandra. And I don't believe that your data is big enough to consider Hadoop environment (maintenance cost).

Sorry but sorting on 2bln rows with over a hundred columns in 2 seconds. I think this would be a big challange. I mean you have 200bln columns. Recommended is a maximum of 2bln per partition key. And i thinks 2bln per partition is too much. If you want a better spark performance, you have to find the bottleneck. Can you write a bit more about your setup? How many cassandra nodes do you have? How many Spark nodes? Hardware Specs?

Apache Ignite has full SQL support with indexes which you can use to improve the performance in your case. I would definitely try it.

Refer to this page for details: https://apacheignite.readme.io/docs/sql-queries

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM