简体   繁体   中英

Spark UI - Spark SQL Query Execution

I am using Spark SQL API. When I see the Spark SQL section on the spark UI which details the query execution plan it says it scans parquet stage multiple times even though I am reading the parquet only once. Is there any logical explanation?

I would also like to understand the different operations like Hash Aggregate, SortMergeJoin etc and understand the Spark UI better as a whole.

If you are doing unions or joins they may force your plan to be "duplicated" since the beginning.

Since spark doesn't keep intermediate states (unless you cache) automatically, it will have to read the sources multiple times

Something like

1- df = Read ParquetFile1
2- dfFiltered = df.filter('active=1')
3- dfFiltered.union(df)

The plan will probably look like : readParquetFIle1 --> union <-- filter <-- readParquetFIle1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM