sql server parent-child join and slow query performance

Question

I have 2 tables (cannot change them)

Parent (id, date, amount)
Child (parent_id, key, value)

indexes

Parent.pk (id)
Parent.idx1 (id, date) include (amount)
Child.pk (parent_id, key)
Child.idx1 (parent_id, key, value)

and query

select sum(amount)
from Parent as p
left outer join Child as c1 on c1.parent_id = p.id and c1.key = 'X'
left outer join Child as c2 on c2.parent_id = p.id and c2.key = 'Y'
where p.date between '20120101' and '20120131'
and c1.value = 'x1'
and c2.value = 'y1'

Problem is performance.
Parent has ~1 500 000 records and Child ~6 000 000 records

Take 1

This query takes ~3sec which is too much for my scenario - it must be less than few milliseconds .

Execution plan shows me that SQL Server is doing index scan on Parent.idx1 and than merge join with Child.idx1 clustered index seek - which is not optimal because it scans whole 1500000 records even when I filter them by date.

Take 2

When I change Parent.idx1 to

Parent.idx1 (date, id) include (amount)

Sql server chooses Clustered index scan on Parent.pk and than again merge join with Child.idx1 . Execution time is ~6s.

Take 3

When I force it to use Parent.idx1 (date, id) include (amount) then it sorts the result before merge join and execution time is even worse ~11s.

Take 4

Tried to create indexed view but cannot use it because of LEFT OUTER JOIN.

Is there any way to make such query - Parent-Child join with filters on both of them - faster?
Without de-normalization.

Update 2013-07-04:
To those answering use INNER JOIN - Yes it's much faster, but I cannot use it.
What I showed here is simplified version of what I really need.
I need to create SQL View for MS Dynamics NAV "G/L Entry" (Parent) and "Ledger Entry Dimension" (Child) tables so that I will be able to read it from that application. Complete view looks like this right now:

create view analysis
as
select 
    v.id as view_id
    , p.date
    , p.Amount
    , c1.value as value1
    , c2.value as value2
    , c3.value as value3
    , c4.value as value4
from Parent as p
    cross join analysis_view as v
    left outer join Child as c1 on c1.parent_id = p.id and c1.key = v.key1
    left outer join Child as c2 on c2.parent_id = p.id and c2.key = v.key2
    left outer join Child as c3 on c3.parent_id = p.id and c3.key = v.key3
    left outer join Child as c4 on c4.parent_id = p.id and c4.key = v.key4

where analysis_view contains 8 records currently and looks like this: analysis_view (id, key1, key2, key3, key4)
and then aplication may query it like this

select sum(amount)
from analysis
where view_id = 1 and date between '20120101' and '20120131'
and value1 = 'x1'
and value2 = 'x2'

or

select sum(amount)
from analysis
where view_id = 1 and date between '20120101' and '20120131'
and value1 = 'x1'
and value3 = 'z1'

MS Dynamics NAV already have de-normalized table for it and queries from it are fast, but it's huge in our case (~10GB) and locks the whole system for around one hour when somone creates new analysis view. Also NAV doesn't know how to produce joins, that's why I must define it on SQL Server side.

Answer 1

Change your LEFT JOIN to INNER JOIN. The predicate c1.value = 'x1' is going to discard the outer left rows anyways.

Answer 2

I'm running through a few attempts, but I'm not finding anything that runs much faster than just fixing up the indexes.

Take 1: Create a materialized view that handles the parent and first child (can't have two references to the same table in a materialized view), then joining it to the child in the query -- not much faster.

Take 2: Create a second materialized view with parent and child2 and using a join between the two materialzed views, again -- not much faster.

Take 3: Use INTERSECT instead of JOIN to merge the two materialized views together -- not much faster.

Take 4: Break out the datetime in the materialized views into year and month columns -- not much faster (actually slower)

The big problem seems to be you have constraints on the child table twice, which eliminates the ability to do materialized views in any sort of efficient manner. I can write a materialized view that will make monthly totals looking for parents with a child key of 'X' and value of 'X1' very very fast by pre-aggregating them, but there isn't enough information left to join back to filter out the amounts that don't have the child2 relationship.

That and I was lazy and trying to do performance testing with 1/10th the amount of data you have, and my results were still very quick (<200ms) no matter what I did. I'm building a full set of test data now, but obviously I don't know what your distribution is. It would help to know how many of the 1,500,000 records have X children, Y children, and both X & Y children. And if this is a fixed query or the keys/values are going to change at run-time.

Here's my test scripts: Set up:

CREATE TABLE Parent (id int NOT NULL CONSTRAINT parent_pk PRIMARY KEY, date datetime, amount decimal(10,2) NOT NULL)
CREATE TABLE Child (parent_id int NOT NULL, [key] char(1) NOT NULL, value char(2) NOT NULL, CONSTRAINT child_pk PRIMARY KEY (parent_id,[key]))
CREATE INDEX Parent_IDX ON Parent (id,date,amount)
CREATE INDEX Child_IDX ON Child (parent_id,[key],value)

DECLARE @RowCount INT
DECLARE @Random INT
DECLARE @Upper INT
DECLARE @Lower INT
DECLARE @InsertDate DATETIME
DECLARE @keys INT
DECLARE @key INT

SET @Lower = 0
SET @Upper = 500
SET @RowCount = 0
WHILE @RowCount < 15000
BEGIN

SELECT @Random = ROUND(((@Upper - @Lower -1) * RAND() + @Lower), 0)
SET @InsertDate = DATEADD(dd, @Random, GETDATE())

INSERT INTO Parent(id,date,amount) 
VALUES (@RowCount , @InsertDate ,@Random)

SET @keys=ROUND(RAND()*3+1,0)
SET @key=0
WHILE @key<@keys
BEGIN
INSERT INTO Child(parent_id,[key],value)
VALUES (@RowCount,SUBSTRING('XYZ',@key+1,1),SUBSTRING('XYZ',@key+1,1)+'1')
SET @key=@key+1
END

SET @RowCount = @RowCount + 1
END

And my scratch pad:

SELECT COUNT(*) ParentCount FROM Parent
GO
SELECT COUNT(*) ChildCount FROM Child
GO
CREATE INDEX Parent_IDX2 ON Parent(date,id)
GO
CREATE VIEW blah WITH SCHEMABINDING AS
SELECT p.id,p.amount,DATEPART(YEAR,p.date) AS yy,DATEPART(Month,p.date) AS mm
from dbo.Parent as p
join dbo.Child as c1 on c1.parent_id = p.id and c1.[key] = 'X' and c1.value = 'x1'
--join dbo.Child as c2 on c2.parent_id = p.id and c2.[key] = 'Y' and c2.value = 'y1'
GO
CREATE UNIQUE CLUSTERED INDEX blah_pk ON blah (id)
CREATE INDEX blah_IDX ON blah (yy,mm,amount)
GO
CREATE VIEW blah2 WITH SCHEMABINDING AS
SELECT p.id,p.amount,DATEPART(YEAR,p.date) AS yy,DATEPART(Month,p.date) AS mm
from dbo.Parent as p
join dbo.Child as c1 on c1.parent_id = p.id and c1.[key] = 'Y' and c1.value = 'y1'

GO
CREATE UNIQUE CLUSTERED INDEX blah2_pk ON blah2 (id)
CREATE INDEX blah2_IDX ON blah2 (yy,mm,amount)
GO
select sum(amount)
from Parent as p
join Child as c1 on c1.parent_id = p.id and c1.[key] = 'X' and c1.value = 'x1'
join Child as c2 on c2.parent_id = p.id and c2.[key] = 'Y' and c2.value = 'y1'
where p.date between '20130801' and '20130831'
GO
select sum(amount)
from blah p
join Child as c2 on c2.parent_id = p.id and c2.[key] = 'Y' and c2.value = 'y1'
where p.yy=2013 and p.mm=8
GO
SELECT sum(blah.amount)
FROM blah
JOIN blah2 ON blah.id=blah2.id AND blah.yy=blah2.yy AND blah.mm=blah2.yy and blah.amount=blah2.amount
where blah.yy=2013 and blah.mm=8

SELECT SUM(amount)
FROM (
SELECT *
FROM blah
where blah.yy=2013 and blah.mm=8
INTERSECT
SELECT *
FROM blah2
where blah2.yy=2013 and blah2.mm=8
) t1

Answer 3

There are several things that are affecting performance (though I'm not an expert). One of those things is having an index on Child that has every column of that table as primary columns of the index, that doesn't really make sense. Another thing is that you are filtering your query according to the values of tables c1 and c2 , converting your query on an INNER JOIN . You should try to modify it to use EXISTS instead, something like this:

select sum(amount)
from Parent as p
where p.date between '20120101' and '20120131'
and exists(select 1 from Child 
           where parent_id = p.id and key = 'X'
           and value = 'x1')
and exists(select 1 from Child 
           where parent_id = p.id and key = 'Y'
           and value = 'y1')

sql server parent-child join and slow query performance

Question

3 answers

solution1
1 2013-07-03 20:36:45

solution2
0 2013-07-03 20:25:37

solution3
0 2013-07-03 20:36:00

sql server parent-child join and slow query performance

Question

3 answers

solution1 1 2013-07-03 20:36:45

solution2 0 2013-07-03 20:25:37

solution3 0 2013-07-03 20:36:00

solution1
1 2013-07-03 20:36:45

solution2
0 2013-07-03 20:25:37

solution3
0 2013-07-03 20:36:00