简体繁体中英

Cassandra table analytics approaches?

原文 2016-05-24 12:43:29 5 3 java/ apache-spark/ solr/ cassandra/ ignite

I have a requerement to do a realtime filtration and sorting over a relatively big partition in a C* table ~ 2-3 billion rows with over a hundred columns in each. It should be possible to filter and sort over any combination of the columns. We tried Apache Solr (DataStax Enterprise 4.8) for that kind of a job but faced next issues:

Solr indexes work bad in case of frequent and bulk data updates
Sometimes Solr just don't rebuild the indexes (waited for hours)
Solr can read only with CL=ONE, so data can be inconsistent

So now we look for another approaches. We're trying Apache Spark 1.4 for now. But looks like the sorting performance is not satisfying - about 1.5 min for 2 bln rows (our target ~ 1-2 sec). May be we're doing something wrong as we are at the very begining of the Spark learning. Also I understand that the performance may be better with more processor cores and memory.

Today I've read about Apache Inginte with in-memory indexing. Probably it is better tool for our case?

So now I'm just looking for suggestion of a tool to perform such a job.

Thanks.

ps: DataStax Enterprise 4.8, Apache Cassandra 2.1.9.791, Apache Solr 4.10.3.1.172, Apache Spark 1.4.1.1.

3 answers

I think your approaches are best you can get. Either Spark (eg SparkSQL) or an in-memory data grid like Ignite. Both will do the same - push the whole stuff into memory and shuffle and dice the data. http://velvia.github.io/Subsecond-Joins-in-Spark-Cassandra/ Flink is another option to consider, but nothing really different from Spark.

On the other hand, 2-3 billion of rows should fit a Postgres DB or something similar. Check if it's not enough for you.

In the Hadoop world, you have Hive (sloow and steady) or Impala (faster and memory heavy) or Spark again. But these won't work well with Cassandra. And I don't believe that your data is big enough to consider Hadoop environment (maintenance cost).

Sorry but sorting on 2bln rows with over a hundred columns in 2 seconds. I think this would be a big challange. I mean you have 200bln columns. Recommended is a maximum of 2bln per partition key. And i thinks 2bln per partition is too much. If you want a better spark performance, you have to find the bottleneck. Can you write a bit more about your setup? How many cassandra nodes do you have? How many Spark nodes? Hardware Specs?

Apache Ignite has full SQL support with indexes which you can use to improve the performance in your case. I would definitely try it.

Refer to this page for details: https://apacheignite.readme.io/docs/sql-queries

Cassandra : Getting read count for Index table in cassandra?

Lagom cassandra readside table not created

Issue in full table scan in cassandra

Writing a DataFrame to a Cassandra table in Java

Multiple keyspace or table update for cassandra

Cassandra CQL table INSERT and INDEX issue

Dynamic Table name in Cassandra Pojo Sink Flink

Spring Data Cassandra - create a table with options values

Cassandra query to get the last row of a table

cassandra update table from java code

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Cassandra : Getting read count for Index table in cassandra? Lagom cassandra readside table not created Issue in full table scan in cassandra Writing a DataFrame to a Cassandra table in Java Multiple keyspace or table update for cassandra Cassandra CQL table INSERT and INDEX issue Dynamic Table name in Cassandra Pojo Sink Flink Spring Data Cassandra - create a table with options values Cassandra query to get the last row of a table cassandra update table from java code

Related Tags

Cassandra table analytics approaches?

Question

3 answers

solution1
1 2018-06-19 18:43:27

solution2
0 2016-05-24 13:49:09

solution3
0 2016-05-25 10:51:42

Cassandra table analytics approaches?

Question

3 answers

solution1 1 2018-06-19 18:43:27

solution2 0 2016-05-24 13:49:09

solution3 0 2016-05-25 10:51:42

solution1
1 2018-06-19 18:43:27

solution2
0 2016-05-24 13:49:09

solution3
0 2016-05-25 10:51:42