简体   繁体   中英

Spark Dataframe Filter Optimization

I'm reading a large number of files from s3 bucket.

After reading those files, I want to perform filter operation on the dataframe.

But when filter operation is executing, data gets downloaded again from s3 bucket. How can I avoid dataframe reloading?

I have tried caching and/or persisting dataframe before the filter operation. But still, data is pulled from s3 bucket again in spark somehow.

var df = spark.read.json("path_to_s3_bucket/*.json")

df.persist(StorageLevel.MEMORY_AND_DISK_SER_2)

df = df.filter("filter condition").sort(col("columnName").asc)

If the dataframe is cached, it should not be reloaded again from s3.

When you call

var df = spark.read.json("path_to_s3_bucket/*.json")

what happens under the cover is that spark does partition discovery, file listing and schema inference (this may run sum jobs in the background to do the file listing in parallel if you have to many files).

Next when you call

df.persist(StorageLevel.MEMORY_AND_DISK_SER_2)

only information is passed to the query plan that you want to persist the data, but the persisting is not happening at this moment (it is a lazy operation).

Next when you call

df = df.filter("filter condition").sort(col("columnName").asc)

again only query plan is updated.

Now if you call an action such as show() , count() and so on, the query plan will be processed and spark job will be executed. So now the data will be loaded on the cluster, it will be written to the memory (because of caching), then it is read back from the cache, it is filtered, sorted, and further processed according to your query plan.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM