Why does the optimizer decide to self-join a table?

Question

I'm analyzing my query that looks like this:

WITH Project_UnfinishedCount AS (
    SELECT P.Title AS title, COUNT(T.ID) AS cnt
    FROM PROJECT P LEFT JOIN TASK T on P.ID = T.ID_PROJECT AND T.ACTUAL_FINISH_DATE IS NULL 
    GROUP BY P.ID, P.TITLE                                                                  
)
SELECT Title
FROM Project_UnfinishedCount
WHERE cnt = (
    SELECT MAX(cnt)
    FROM Project_UnfinishedCount
    );

It returns a title of a project that has the biggest number of unfinished tasks in it.

Here is its execution plan:

I wonder why it has steps 6-8 that look like self-join of project table? And than it stores the result of the join as a view, but the view, according to rows and bytes columns is the same as project table. Why does he do it?

I'd also like to know what 2 and 1 steps stand for. I guess, 2 stores the result of my CTE to use it in steps 10-14 and 1 removes the rows from the view that don't have the 'cnt' value that was returned by the subquery, is this a correct guess?

Answer 1

In addition to the comments above, when you reference a CTE more than once, there is a heuristic that tells the optimizer to materialize the CTE, which is why you see the temp table transformation.

A few other comments/questions regarding this query. I'm assuming that the relationship is that a PROJECT can have 0 or more tasks, and each TASK is for one and only one PROJECT. In that case, I wonder why you have an outer join? Moreover, you are joining on the ACTUAL_FINISH_DATE column. This would mean that if you have a project, where all the task were complete, then the outer join would materialize the non-matching row, which would make your query results appear to indicate that there was 1 unfinished task. So I think your CTE should look more like:

SELECT P.Title AS title, COUNT(T.ID) AS cnt
FROM PROJECT P 
JOIN TASK T on P.ID = T.ID_PROJECT
WHERE T.ACTUAL_FINISH_DATE IS NULL 
GROUP BY P.ID, P.TITLE

With all that being said, these "find the match (count, max etc) within a group" type of queries are often more efficient when written as a window function. That way you can eliminate the self join. This can make a big performance difference when you have millions or billions of rows. So for example, your query could be re-written as:

 SELECT TITLE, CNT
 FROM (
    SELECT P.Title AS title, COUNT(T.ID) AS cnt
           , RANK() OVER( ORDER BY COUNT(*) DESC ) RNK
    FROM PROJECT P
    JOIN TASK T on P.ID = T.ID_PROJECT
    WHERE T.ACTUAL_FINISH_DATE IS NULL 
    GROUP BY P.ID, P.TITLE  
      )
  WHERE RNK=1

Why does the optimizer decide to self-join a table?

Question

1 answers

solution1
2 ACCPTED 2021-06-07 19:06:44

Why does the optimizer decide to self-join a table?

Question

1 answers

solution1 2 ACCPTED 2021-06-07 19:06:44

solution1
2 ACCPTED 2021-06-07 19:06:44