Fastest Way to gather nodes from a huge number of XML

Question

I have a huge number (400 000) of big XML (200 to 4000 rows with 40 parent-child relationships). I would like to parse them all and gather all the nodes that exist in them.

with a XML like

<tag1>
<tag2>
    <tag3>Content3</tag3>
</tag2>
<tag2>
    <tag4>Content4</tag4>
</tag2>
<tag2>
    <tag4>Content4</tag4>
</tag2>
<tag2>
    <tag5><tag6>Content6</tag6></tag5>
</tag2>
</tag1>

I would like to get

tag1
tag1>tag2
tag1>tag2>tag3
tag1>tag2
tag1>tag2>tag4
tag1>tag2
tag1>tag2>tag4
tag1>tag2
tag1>tag2>tag5
tag1>tag2>tag5>tag6

or at least (leaf removed):

tag1
tag1>tag2
tag1>tag2
tag1>tag2
tag1>tag2
tag1>tag2>tag5

Because my real goal is to check the nodes, which are modeled as tables in the target database.

Output can be a query result, a table or a file, I don't mind.

The final objective is to use this data to check if SSIS, who is used to load XML content into a database, has not missed any node. In fact we KNOW it has missed somes so now we must find which ones.

I have checked the SQL Server 2012 features but I have 2 issues: - it doesn't give me any pointer on the performance with FILES. I need the fastest way when I use files, not when I use XML content in a string - it's a bit cumbersome

I have done a solution of my own with Qlikview which checks if the possible nodes (I have the XSD) are in the XML and output the result in a file, which is fine, but too slow (1 to 2s per XML, too long).

Thanks guys !

Answer 1

I was looking for not answered tsql/xml questions and found yours. It made me curious, don't know if this is of any need today, but this was my suggestion:

It will work for any XML down to any depth...

I must admit, that I normally do not use CURSORs, but in this case I did not find another approach. If you don't mind it would be nice to test its speed and place a short answer - just for curiousity :-)

DECLARE @x XML=
'<tag1>
  <tag2>
    <tag3>Content3</tag3>
  </tag2>
  <tag2>
    <tag4>Content4</tag4>
  </tag2>
  <tag2>
    <tag4>Content4</tag4>
  </tag2>
  <tag2>
    <tag5>
      <tag6>Content6</tag6>
    </tag5>
  </tag2>
</tag1>';

CREATE TABLE #HelpTable(NodeIndex INT UNIQUE,NextNodeName VARCHAR(100),HasChildren BIT);
CREATE TABLE #FinalTags(ID INT IDENTITY,TagNames VARCHAR(1000));

WITH RootNode AS
(
    SELECT RN.value('local-name(.)','varchar(100)') AS RN_Name
          ,RN.query('.') AS RN_Node
    FROM @x.nodes('*') AS The(RN)
)
,AnalyzeNodes AS
(
    SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) * 10 AS NodeIndex 
          ,RN_Name
          ,TheNext.Nodes.value('local-name(.)','varchar(100)') AS NextNodeName
          ,CASE WHEN TheNext.Nodes.value('count(./*)','int')=0 THEN 0 ELSE 1 END AS HasChildren
    FROM RootNode
    CROSS APPLY RN_Node.nodes('//*') AS TheNext(Nodes)
) 

INSERT INTO #HelpTable
SELECT AnalyzeNodes.NodeIndex,AnalyzeNodes.NextNodeName,AnalyzeNodes.HasChildren
FROM AnalyzeNodes
UNION ALL
SELECT an.NodeIndex+1,RN_Name,1 
FROM AnalyzeNodes AS an
WHERE an.HasChildren=0

DECLARE @collect VARCHAR(1000)='';
DECLARE @tag VARCHAR(100);
DECLARE @children BIT;

DECLARE cur CURSOR FAST_FORWARD
FOR
    SELECT NextNodeName,HasChildren 
    FROM #HelpTable
    ORDER BY NodeIndex;

OPEN cur;

FETCH NEXT FROM cur INTO @tag,@children

WHILE @@FETCH_STATUS = 0
BEGIN
    INSERT INTO #FinalTags VALUES(@collect +  '>' + @tag);
    IF @children=0
        SET @collect='';
    ELSE
        SET @collect=@collect + '>' + @tag;

    FETCH NEXT FROM cur INTO @tag,@children
END

CLOSE cur;
DEALLOCATE cur;

SELECT SUBSTRING(TagNames,2,1000) AS TagNames
FROM #FinalTags
WHERE ID=1 OR TagNames<>(SELECT ft.TagNames FROM #FinalTags AS ft WHERE ft.ID=1)
ORDER BY ID,TagNames;

DROP TABLE #FinalTags;
DROP TABLE #HelpTable;

The result:

tag1
tag1>tag2
tag1>tag2>tag3
tag1>tag2
tag1>tag2>tag4
tag1>tag2
tag1>tag2>tag4
tag1>tag2
tag1>tag2>tag5
tag1>tag2>tag5>tag6

Fastest Way to gather nodes from a huge number of XML

Question

1 answers

solution1
0 2016-02-10 12:42:03

Fastest Way to gather nodes from a huge number of XML

Question

1 answers

solution1 0 2016-02-10 12:42:03

solution1
0 2016-02-10 12:42:03