简体   繁体   中英

How to do an efficient Outer or Left join in XQuery?

I have the following data:

<!-- subjects.xml -->
<Subjects>
    <Subject>
        <Id>1</Id>
        <Name>Maths</Name>
    </Subject>
    <Subject>
        <Id>2</Id>
        <Name>Science</Name>
    </Subject>
    <Subject>
        <Id>2</Id>
        <Name>Advanced Science</Name>
    </Subject>
    <Subject>
        <Id>3</Id>
        <Name>History</Name>
    </Subject>
</Subjects>

which is to be joined to:

<!-- courses.xml-->
<Courses>
    <Course>
        <SubjectId>1</SubjectId>
        <Name>Algebra I</Name>
    </Course>
    <Course>
        <SubjectId>1</SubjectId>
        <Name>Algebra II</Name>
    </Course>
    <Course>
        <SubjectId>1</SubjectId>
        <Name>Percentages</Name>
    </Course>
    <Course>
        <SubjectId>2</SubjectId>
        <Name>Physics</Name>
    </Course>
    <Course>
        <SubjectId>2</SubjectId>
        <Name>Biology</Name>
    </Course>
</Courses>

I wish to do a left join on the first table to the second table so as to get the following output:

<Results>
    <Result>
        <Table1>
            <Subject>
                <Id>1</Id>
                <Name>Maths</Name>
            </Subject>
        </Table1>
        <Table2>
            <Course>
                <SubjectId>1</SubjectId>
                <Name>Algebra I</Name>
            </Course>
            <Course>
                <SubjectId>1</SubjectId>
                <Name>Algebra II</Name>
            </Course>
            <Course>
                <SubjectId>1</SubjectId>
                <Name>Percentages</Name>
            </Course>
        </Table2>
    </Result>
    <Result>
        <Table1>
            <!-- Notice there are 2 subjects here, as they both have the same ID-->
            <Subject>
                <Id>2</Id>
                <Name>Science</Name>
            </Subject>
            <Subject>
                <Id>2</Id>
                <Name>Advanced Science</Name>
            </Subject>
        </Table1>
        <Table2>
            <Course>
                <SubjectId>2</SubjectId>
                <Name>Physics</Name>
            </Course>
            <Course>
                <SubjectId>2</SubjectId>
                <Name>Biology</Name>
            </Course>
        </Table2>
    </Result>
    <Result>
        <Table1>
            <Subject>
                <Id>3</Id>
                <Name>History</Name>
            </Subject>
        </Table1>
        <Table2>
            <!-- Notice this section is empty -->
        </Table2>
    </Result>
</Results>

I have the following code to do this:

<Results>
    {
        (: For each element in courses, where it's 'SubjectId' exists in "subjects.xml":)
        for $e2 in doc("courses.xml")/Courses/Course
        let $foriegnId := $e2/SubjectId
        group by $foriegnId
        let $e1 := doc("subjects.xml")/Subjects/Subject[Id = $foriegnId]
        where $e1

        return
            <Result>
                <Table1>
                    {$e1}
                </Table1>
                <Table2>
                    {$e2}
                </Table2>
            </Result>
    }

    {
    (: PART2 :)
    (:Show the remaining elements in courses that have not yet been outputted:)
        for $e1 in doc('subjects.xml')/Subjects/Subject
        let $idVal := $e1/Id
        group by $idVal
        where not(doc('courses.xml')/Courses/Course/SubjectId = $idVal)
        return
            <Result>
                <Table1>
                    {$e1}
                </Table1>
                <Table2/>
            </Result>
    }
</Results>

Note the code works fine and does the job. However, I have found that when executing the code for large inputs (750 Subjects, each with 120 courses as well as 100 Subjects without any Courses and 100 Courses without any Subjects), the script runs extremly slow!

What can I do to make my script faster? Is there a better way of doing this? What's the time complexity?

Update 2

It turns out I have heavily misidentified the problem. The problem was in fact very little to do with part 2 of the code but rather part 1 of the code.

What I did was:

for $e2 in doc("courses.xml")/Courses/Course
let $foriegnId := $e2/SubjectId
let $e1 := doc("subjects.xml")/Subjects/Subject[Id = $foriegnId]
group by $foriegnId

when what I should have done was:

for $e2 in doc("courses.xml")/Courses/Course
let $foriegnId := $e2/SubjectId
group by $foriegnId
let $e1 := doc("subjects.xml")/Subjects/Subject[Id = $foriegnId]

This reduced the time of the code from 30,000ms to around 4,000ms.

Further performance improvements are welcome.

Depending on how the query is optimized, the list of IDs might be put together again and again, once for each subject. Fetch the list once in advance, and subsequently verify against this.

    let $subjectIds := doc('courses.xml')/Courses/Course/SubjectId
    for $e1 in doc('subjects.xml')/Subjects/Subject
    let $idVal := $e1/Id
    group by $idVal
    where not($subjectIds = $idVal)
    return
        <Result>
            <Table1>
                {$e1}
            </Table1>
            <Table2/>
        </Result>

A further optimization might be to prune the list of partially redundant subject IDs to a sequence of their distinct values before:

    let $subjectIds := distinct-values(doc('courses.xml')/Courses/Course/SubjectId)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM