Spark UI - Spark SQL Query Execution

Question

I am using Spark SQL API. When I see the Spark SQL section on the spark UI which details the query execution plan it says it scans parquet stage multiple times even though I am reading the parquet only once. Is there any logical explanation?

I would also like to understand the different operations like Hash Aggregate, SortMergeJoin etc and understand the Spark UI better as a whole.

Answer 1

If you are doing unions or joins they may force your plan to be "duplicated" since the beginning.

Since spark doesn't keep intermediate states (unless you cache) automatically, it will have to read the sources multiple times

Something like

1- df = Read ParquetFile1
2- dfFiltered = df.filter('active=1')
3- dfFiltered.union(df)

The plan will probably look like : readParquetFIle1 --> union <-- filter <-- readParquetFIle1

Spark UI - Spark SQL Query Execution

Question

1 answers

solution1
0 2019-05-28 18:11:37

Spark UI - Spark SQL Query Execution

Question

1 answers

solution1 0 2019-05-28 18:11:37

solution1
0 2019-05-28 18:11:37