简体   繁体   中英

Does Spark from DSE laod all data into RDD before running SQL Query?

Running DSE 4.7

So say I have a 4 node DSE Cassandra/Spark cluster...

I have a Cassandra table with say 4,000,000 records in it.

On Spark running the following Spark SQL "select * from table where email = ? or mobile = ?"

Will Spark load all the data into RDD and then filter based on the where clause? Will each spark node have 1,000,000 records per node loaded into memory?

Will spark load all the data into RDD and then filter based on the where clause?

It depends on your database schema. If your query explicitly restricts scan to a single C* partition (and ours where email = ? or mobile = ? definitely does not), Spark will load only part of the data.

In your case it will have to scan all the data.

Will each spark node have 1,000,000 records per node loaded into memory?

Again, it depends of your dataset size and amount of RAM on worker nodes. Spark RDDs are not always fully loaded into RAM, in your case it can be split into smaller parts (eg 100k rows), loaded into ram, filtered according to your query and saved after that, one-by-one.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM