简体   繁体   中英

UNION ALL Performance IN SQL Server 2005

I have a query with a long chain of CTEs which ends with

SELECT RegionName, AreaName, CityName, SubCityName, StreetName 
FROM tDictionaryStreets
UNION ALL
SELECT RegionName, AreaName, CityName, SubCityName, StreetName 
FROM tDictionaryRegions

The execution time of this query is 1450 ms. When I execute these 2 SELECTs separatly it takes much less time. For the query

SELECT RegionName, AreaName, CityName, SubCityName, StreetName 
FROM tDictionaryStreets

execution time is 106 ms. And for the query

SELECT RegionName, AreaName, CityName, SubCityName, StreetName 
FROM tDictionaryRegions

it's 20 ms.

Why UNION ALL increases the execution time in more than 10 times? What can I do to decrease it?

Thank you for your help.

UPDATED The whole query (I shortened it, but the problem still presents) is

WITH tFoundRegions AS
(
    SELECT KladrItemName FROM dbo.tBuiltKladrItemsWithQuants
    WHERE UserID = @UserID AND (indeces & 1) > 0
),
tFoundAreas AS
(
    SELECT KladrItemName FROM dbo.tBuiltKladrItemsWithQuants
    WHERE UserID = @UserID AND (indeces & 2) > 0
),
tFoundCities AS
(
    SELECT KladrItemName FROM dbo.tBuiltKladrItemsWithQuants
    WHERE UserID = @UserID AND (indeces & 4) > 0
),
tFoundSubCities AS
(
    SELECT KladrItemName FROM dbo.tBuiltKladrItemsWithQuants
    WHERE UserID = @UserID AND (indeces & 8) > 0
),
tFoundStreets AS
(
    SELECT KladrItemName FROM dbo.tBuiltKladrItemsWithQuants
    WHERE UserID = @UserID AND (indeces & 16) > 0
),
tDictionaryStreets AS
(
    SELECT DISTINCT
        CASE WHEN RegionName IN (SELECT KladrItemName FROM tFoundRegions) THEN RegionName ELSE NULL END RegionName
      , CASE WHEN AreaName IN (SELECT KladrItemName FROM tFoundAreas) THEN AreaName ELSE NULL END AreaName
      , CASE WHEN CityName IN (SELECT KladrItemName FROM tFoundCities) THEN CityName ELSE NULL END CityName
      , CASE WHEN SubCityName  IN (SELECT KladrItemName FROM tFoundSubCities) THEN SubCityName ELSE NULL END SubCityName
      , StreetName 
    FROM StreetNames
    WHERE StreetName IN (SELECT KladrItemName FROM tFoundStreets)
),
tMissingSubCities AS
(
    SELECT KladrItemName FROM tFoundSubCities
    WHERE KladrItemName NOT IN (SELECT SubCityName FROM tDictionaryStreets)
),
tDictionarySubCities AS
(
    SELECT DISTINCT 
        CASE WHEN RegionName IN (SELECT KladrItemName FROM tFoundRegions) THEN RegionName ELSE NULL END RegionName
      , CASE WHEN AreaName IN (SELECT KladrItemName FROM tFoundAreas) THEN AreaName ELSE NULL END AreaName
      , CASE WHEN CityName IN (SELECT KladrItemName FROM tFoundCities) THEN CityName ELSE NULL END CityName
      , SubCityName
      , NULL StreetName 
    FROM SubCityNames
    WHERE SubCityName IN (SELECT KladrItemName FROM tMissingSubCities)
)
SELECT RegionName, AreaName, CityName, SubCityName, StreetName 
FROM tDictionaryStreets
UNION ALL
SELECT RegionName, AreaName, CityName, SubCityName, StreetName 
FROM tDictionarySubCities

Make sure you clear the execution + data caches between each test run.

eg

DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS

If you run with the UNION ALL first, and then run the 2 selects separately afterwards, the data will already be cached in memory making performance much better (therefore giving the false impression that the subsequent approach is quicker when it may not be).

If you used a UNION then that may well be slower as it has to apply a DISTINCT, but UNION ALL doesn't have to do that so it should be no different.

Update:
Have a look at the execution plans and compare them - see if there is any difference. You can view the execution plan by clicking the "Include Actual Execution Plan" button in SSMS before running the query

Update 2:
Based on full CTEs given, I think I'd be looking at optimising those - I don't think the UNION ALL is actually the problem.

IMHO, best thing to try is work through the CTEs one by one and try to optimise each one individually so that when you then combine them all in the main query, they perform better.

eg for tDictionaryStreets, how about trying this:

SELECT DISTINCT
    r.KladrItemName AS RegionName,
        a.KladrItemName AS AreaName,
        c.KladrItemName AS CityName,
        sc.KladrItemName AS SubCityName,
        s.StreetName      
FROM StreetNames s
    JOIN tFoundStreets fs ON s.StreetName = fs.KladrItemName
    LEFT JOIN tFoundRegions r ON s.RegionName = r.KladrItemName
    LEFT JOIN tFoundAreas a ON s.AreaName = a.KladrItemName
    LEFT JOIN tFoundCities c ON s.CityName = c.KladrItemName
    LEFT JOIN tFoundSubCities sc ON s.SubCityName = scc.KladrItemName

KladrItemName on each table should at least have an index on. Try reworking tDictionarySubCities in the same kind of way with joins too.

Could you compare the execution plans? what is different? "Union all" should work fine, as there is no duplicate removal (this requires sorting, which is costly for large sets of data).

Could be network (unlikely) or memory. Depending on the number of rows each result set brings back. One way to check whether it is network or the server is to include client statistics in SSMS (Query - Include Client Statistics - SHIFT-ALT-S). At the bottom you can differentiate where the bulk of the time is spent.

Could you compare the execution plans? [...] lmsasu [...]When query runs fast it uses 'merge join', when slow - 'nested loops'.[...]

Can't comment yet but what you see in the execution plan is the difference between "joining" two result sets (merge join) and a RBAR (pronounce reebar - Row By Agonizing Row [Jeff Moden]) operation, commonly called a loop.

Merge Join: SQL finds two result sets with a common link and does a set based operation to bring the two sets together. Nested Loop: SQL can't find a common link and joins one row from set 1 to all rows from set 2 row by row and discards the ones that don't match.

Gut feeling is that SQL stumbles across the NULL results which are unknown results. Try assigning a value like "XYZ" (or anything known not to appear) that you can simply filter in the last query. This might avoid the Nested Loop in certain result sets as the values are determined and not unknown. Similar to:

[...]
tDictionarySubCities AS 
( 
    SELECT DISTINCT  
        CASE WHEN RegionName IN (SELECT KladrItemName FROM tFoundRegions) THEN RegionName ELSE 'XYZXYZ' END RegionName 
      , CASE WHEN AreaName IN (SELECT KladrItemName FROM tFoundAreas) THEN AreaName ELSE 'XYZXYZ' END AreaName 
      , CASE WHEN CityName IN (SELECT KladrItemName FROM tFoundCities) THEN CityName ELSE 'XYZXYZ' END CityName 
      , SubCityName 
      , NULL StreetName  
    FROM SubCityNames 
    WHERE SubCityName IN (SELECT KladrItemName FROM tMissingSubCities) 
) 
SELECT RegionName, AreaName, CityName, SubCityName, StreetName  
FROM tDictionaryStreets 
WHERE RegionName <> 'XYZ'
UNION ALL 
SELECT RegionName, AreaName, CityName, SubCityName, StreetName  
FROM tDictionarySubCities 
WHERE RegionName <> 'XYZ'

I have stumbled upon a similar problem and after careful analyzing the situation it appears to me that using a cte in a UNION ALL query turns off parallelization (which is most likely a bug).

And in other words UNION ALL will be equal to the sum off the two queries where each one has been set to (maxdop 1).

While there has to be done more testing and it is actually hard to make a query that will use parallelization to be able to test upon or even submit as a bug to Microsoft Connect, still your problem as well as the problem described in Why CTE (recurisve) is not parallilized (MAXDOP=8)? are also proof that there is actually such a problem.

EDIT: I have tested more extensively, and while a UNION ALL does parallelize many times, still there are situations that a without the UNION ALL it is parallelizing but having a UNION ALL turns it off.

While this might be a bug, it might also be a result of the fact that the query optimizer does not look for the best plan and instead it looks for a good plan, and since two queries joined with a UNION are already generating complex plans as well as a query with a CTE, it simply might find a good plan before even considering the option of parallelization.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM