简体繁体中英

Does Spark from DSE laod all data into RDD before running SQL Query?

原文 2015-06-11 14:47:14 5 1 cassandra/ apache-spark/ datastax

Running DSE 4.7

So say I have a 4 node DSE Cassandra/Spark cluster...

I have a Cassandra table with say 4,000,000 records in it.

On Spark running the following Spark SQL "select * from table where email = ? or mobile = ?"

Will Spark load all the data into RDD and then filter based on the where clause? Will each spark node have 1,000,000 records per node loaded into memory?

1 answers

Will spark load all the data into RDD and then filter based on the where clause?

It depends on your database schema. If your query explicitly restricts scan to a single C* partition (and ours where email = ? or mobile = ? definitely does not), Spark will load only part of the data.

In your case it will have to scan all the data.

Will each spark node have 1,000,000 records per node loaded into memory?

Again, it depends of your dataset size and amount of RAM on worker nodes. Spark RDDs are not always fully loaded into RAM, in your case it can be split into smaller parts (eg 100k rows), loaded into ram, filtered according to your query and saved after that, one-by-one.

Spark SQL Query on DSE Cassandra in Scala

NoHostAvailableException while running spark with dse

Spark: All RDD data not getting saved to Cassandra table

DSE 4 Analytics Node ~ Does and should it have Data?

DSE: Unable to sstablellaoding data from 4.8.9 to 5.0.2

DSE Cassandra Spark Error

Combine results from batch RDD with streaming RDD in Apache Spark

Write RDD[entity] in cassandra from Spark

Spark not use DirectJoin over DSE

'hive on spark' in datastax enterprise DSE?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Spark SQL Query on DSE Cassandra in Scala NoHostAvailableException while running spark with dse Spark: All RDD data not getting saved to Cassandra table DSE 4 Analytics Node ~ Does and should it have Data? DSE: Unable to sstablellaoding data from 4.8.9 to 5.0.2 DSE Cassandra Spark Error Combine results from batch RDD with streaming RDD in Apache Spark Write RDD[entity] in cassandra from Spark Spark not use DirectJoin over DSE 'hive on spark' in datastax enterprise DSE?

Related Tags

Does Spark from DSE laod all data into RDD before running SQL Query?

Question

1 answers

solution1 2 ACCPTED 2015-06-11 16:10:47

solution1
2 ACCPTED 2015-06-11 16:10:47