How to set the start point for reading an xml file?

Question

i have a large XML-Document (111 MB), and want to go to a special node (by index) very fast. The Document has about 1000000 nodes like this:

<Kt>
<PLZ>01067</PLZ>
<Ort>Dresden</Ort>
<OT>NULL</OT>
<Strasse>Potthoffstr.</Strasse>
</Kt>

I want to "jump", for example to the one millionth node in the document and start from this to read. All nodes behind of this must be ignore. I've tried it already with the XMLReader but these start always to read from the first node.

        int i = 0;//                    v-----------Index of the Node where I want to go!
        while (reader.Read() == (i < 1000000))
        {
            if (reader.Name == "PLZ")
            {
                textBox1.Text = reader.ReadString();
            }

            if (reader.Name == "Ort")
            {
                textBox2.Text = reader.ReadString();
            }

            if (reader.Name == "OT")
            {
                textBox3.Text = reader.ReadString();
            }

            if (reader.Name == "Strasse")
            {
                textBox4.Text = reader.ReadString();
                i++;
            }

This is how the structure of the XML-Document looks!

<?xml version="1.0" encoding="UTF-8"?>
<dataroot xmlns:od="urn:schemas-microsoft-com:officedata" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  xsi:noNamespaceSchemaLocation="Kt.xsd" generated="2014-10-21T18:20:30">
<Kt>
<PLZ>01...</PLZ>
<Ort>Dresden</Ort>
<OT>NULL</OT>
<Strasse>NULL</Strasse>
</Kt>
<Kt>
<PLZ>01067</PLZ>
<Ort>Dresden</Ort>
<OT>Innere Altstadt</OT>
<Strasse>Marienstr.</Strasse>
</Kt>
<Kt>
<PLZ>01067</PLZ>
<Ort>Dresden</Ort>
<OT>NULL</OT>
<Strasse>Potthoffstr.</Strasse>
</Kt>

In other words: What are the possibilities to load a part of an large xml-file out without reading the complete file.

Answer 1

You will have to read all the data up to that point, because xml (in common with most text-based deserialization formats) does not lend itself to skipping data. XmlReader has some helper methods to assist with this, like ReadToNextSibling and ReadToFollowing . Basically, that's the best you'll do unless you pre-index the file (separately) with the byte offsets of various elements (say, every 100th or 1000th element). And doing that means you'd be working in fragment (rather than document) mode, and you'd need to be very careful about namespaces (particularly: aliases declared on the document root).

Basically, what you are doing seems about right , if we start with the premise of having a 111MB, multi-million-element xml file. Frankly, my advice would be don't do that in the first place . Xml is not a good choice for huge data, unless it is purely as a dead-drop, perhaps to be bulk-loaded again later. It does not allow for efficient random access.

Answer 2

If you need to do this often, then you're doing the wrong thing. The data should be in a database, or at the very least, stored in smaller chunks.

If you're not doing it often, then is it really a problem? I would expect it to be doable in 5 seconds or so.

How to set the start point for reading an xml file?

Question

2 answers

solution1
6 2014-10-31 12:14:19

solution2
1 2014-10-31 15:38:26

How to set the start point for reading an xml file?

Question

2 answers

solution1 6 2014-10-31 12:14:19

solution2 1 2014-10-31 15:38:26

solution1
6 2014-10-31 12:14:19

solution2
1 2014-10-31 15:38:26