简体   繁体   English

在 C# 代码中解析(大)XML 的最佳方法是什么?

[英]What is the best way to parse (big) XML in C# Code?

I'm writing a GIS client tool in C# to retrieve "features" in a GML-based XML schema (sample below) from a server.我正在 C# 中编写一个 GIS 客户端工具,以从服务器检索基于 GML 的 XML 模式(下例)中的“特征”。 Extracts are limited to 100,000 features.提取限制为 100,000 个特征。

I guestimate that the largest extract.xml might get up around 150 megabytes, so obviously DOM parsers are out I've been trying to decide between XmlSerializer and XSD.EXE generated bindings --OR-- XmlReader and a hand-crafted object graph. I guestimate that the largest extract.xml might get up around 150 megabytes, so obviously DOM parsers are out I've been trying to decide between XmlSerializer and XSD.EXE generated bindings --OR-- XmlReader and a hand-crafted object graph.

Or maybe there's a better way which I haven't considered yet?或者也许有更好的方法我还没有考虑过? Like XLINQ, or????像 XLINQ,还是????

Please can anybody guide me?请问有人可以指导我吗? Especially with regards to the memory efficiency of any given approach.特别是关于任何给定方法的 memory 效率。 If not I'll have to "prototype" both solutions and profile them side-by-side.如果不是,我将不得不对这两个解决方案进行“原型设计”并并排分析它们。

I'm a bit of a raw prawn in .NET.我在 .NET 有点像生虾。 Any guidance would be greatly appreciated.任何指导将不胜感激。

Thanking you.感谢您。 Keith.基思。


Sample XML - upto 100,000 of them, of upto 234,600 coords per feature.样品 XML - 最多 100,000 个,每个特征最多 234,600 个坐标。

<feature featId="27168306" fType="vegetation" fTypeId="1129" fClass="vegetation" gType="Polygon" ID="0" cLockNr="51598" metadataId="51599" mdFileId="NRM/TIS/VEGETATION/9543_22_v3" dataScale="25000">
  <MultiGeometry>
    <geometryMember>
      <Polygon>
        <outerBoundaryIs>
          <LinearRing>
            <coordinates>153.505004,-27.42196 153.505044,-27.422015 153.503992 .... 172 coordinates omitted to save space ... 153.505004,-27.42196</coordinates>
          </LinearRing>
        </outerBoundaryIs>
      </Polygon>
    </geometryMember>
  </MultiGeometry>
</feature>

Use XmlReader to parse large XML documents.使用XmlReader解析大型 XML 文档。 XmlReader provides fast, forward-only, non-cached access to XML data. XmlReader提供对 XML 数据的快速、只进、非缓存访问。 (Forward-only means you can read the XML file from beginning to end but cannot move backwards in the file.) XmlReader uses small amounts of memory, and is equivalent to using a simple SAX reader. (Forward-only意味着你可以从头到尾读取XML文件,但不能在文件中向后移动。) XmlReader使用了少量的memory,相当于使用了一个简单的SAX阅读器。

    using (XmlReader myReader = XmlReader.Create(@"c:\data\coords.xml"))
    {
        while (myReader.Read())
        {
           // Process each node (myReader.Value) here
           // ...
        }
    }

You can use XmlReader to process files that are up to 2 gigabytes (GB) in size.您可以使用 XmlReader 处理最大为 2 GB 的文件。

Ref: How to read XML from a file by using Visual C#参考:如何使用 Visual C# 从文件中读取 XML

Asat 14 May 2009: I've switched to using a hybrid approach... see code below. Asat 2009 年 5 月 14 日:我已改用混合方法……请参阅下面的代码。

This version has most of the advantages of both:此版本具有两者的大部分优点:
* the XmlReader/XmlTextReader (memory efficiency --> speed); * XmlReader/XmlTextReader(内存效率 --> 速度); and
* the XmlSerializer (code-gen --> development expediancy and flexibility). * XmlSerializer(代码生成 --> 开发权宜性和灵活性)。

It uses the XmlTextReader to iterate through the document, and creates "doclets" which it deserializes using the XmlSerializer and "XML binding" classes generated with XSD.EXE.它使用 XmlTextReader 遍历文档,并创建使用 XmlSerializer 反序列化的“doclet”和用 XSD.EXE 生成的“XML 绑定”类。

I guess this recipe is universally applicable, and it's fast... I'm parsing a 201 MB XML Document containing 56,000 GML Features in about 7 seconds... the old VB6 implementation of this application took minutes (or even hours) to parse large extracts... so I'm lookin' good to go.我想这个秘诀是普遍适用的,而且速度很快......我正在解析一个 201 MB XML 文档,其中包含 56,000 个 GML 功能,大约需要 7 秒......这个应用程序的旧 VB6 实现需要几分钟(甚至几小时)来解析大量提取物......所以我对 go 很满意。

Once again, a BIG Thank You to the forumites for donating your valuable time.再一次,非常感谢论坛成员贡献了您宝贵的时间。 I really appreciate it.对此,我真的非常感激。

Cheers all.祝大家欢呼。 Keith.基思。

using System;
using System.Reflection;
using System.Xml;
using System.Xml.Serialization;
using System.IO;
using System.Collections.Generic;

using nrw_rime_extract.utils;
using nrw_rime_extract.xml.generated_bindings;

namespace nrw_rime_extract.xml
{
    internal interface ExtractXmlReader
    {
        rimeType read(string xmlFilename);
    }

    /// <summary>
    /// RimeExtractXml provides bindings to the RIME Extract XML as defined by
    /// $/Release 2.7/Documentation/Technical/SCHEMA and DTDs/nrw-rime-extract.xsd
    /// </summary>
    internal class ExtractXmlReader_XmlSerializerImpl : ExtractXmlReader
    {
        private Log log = Log.getInstance();

        public rimeType read(string xmlFilename)
        {
            log.write(
                string.Format(
                    "DEBUG: ExtractXmlReader_XmlSerializerImpl.read({0})",
                    xmlFilename));
            using (Stream stream = new FileStream(xmlFilename, FileMode.Open))
            {
                return read(stream);
            }
        }

        internal rimeType read(Stream xmlInputStream)
        {
            // create an instance of the XmlSerializer class, 
            // specifying the type of object to be deserialized.
            XmlSerializer serializer = new XmlSerializer(typeof(rimeType));
            serializer.UnknownNode += new XmlNodeEventHandler(handleUnknownNode);
            serializer.UnknownAttribute += 
                new XmlAttributeEventHandler(handleUnknownAttribute);
            // use the Deserialize method to restore the object's state
            // with data from the XML document.
            return (rimeType)serializer.Deserialize(xmlInputStream);
        }

        protected void handleUnknownNode(object sender, XmlNodeEventArgs e)
        {
            log.write(
                string.Format(
                    "XML_ERROR: Unknown Node at line {0} position {1} : {2}\t{3}",
                    e.LineNumber, e.LinePosition, e.Name, e.Text));
        }

        protected void handleUnknownAttribute(object sender, XmlAttributeEventArgs e)
        {
            log.write(
                string.Format(
                    "XML_ERROR: Unknown Attribute at line {0} position {1} : {2}='{3}'",
                    e.LineNumber, e.LinePosition, e.Attr.Name, e.Attr.Value));
        }

    }

    /// <summary>
    /// xtractXmlReader provides bindings to the extract.xml 
    /// returned by the RIME server; as defined by:
    ///   $/Release X/Documentation/Technical/SCHEMA and 
    /// DTDs/nrw-rime-extract.xsd
    /// </summary>
    internal class ExtractXmlReader_XmlTextReaderXmlSerializerHybridImpl :
        ExtractXmlReader
    {
        private Log log = Log.getInstance();

        public rimeType read(string xmlFilename)
        {
            log.write(
                string.Format(
                    "DEBUG: ExtractXmlReader_XmlTextReaderXmlSerializerHybridImpl." +
                    "read({0})",
                    xmlFilename));

            using (XmlReader reader = XmlReader.Create(xmlFilename))
            {
                return read(reader);
            }

        }

        public rimeType read(XmlReader reader)
        {
            rimeType result = new rimeType();
            // a deserializer for featureClass, feature, etc, "doclets"
            Dictionary<Type, XmlSerializer> serializers = 
                new Dictionary<Type, XmlSerializer>();
            serializers.Add(typeof(featureClassType), 
                newSerializer(typeof(featureClassType)));
            serializers.Add(typeof(featureType), 
                newSerializer(typeof(featureType)));

            List<featureClassType> featureClasses = new List<featureClassType>();
            List<featureType> features = new List<featureType>();
            while (!reader.EOF)
            {
                if (reader.MoveToContent() != XmlNodeType.Element)
                {
                    reader.Read(); // skip non-element-nodes and unknown-elements.
                    continue;
                }

                // skip junk nodes.
                if (reader.Name.Equals("featureClass"))
                {
                    using (
                        StringReader elementReader =
                            new StringReader(reader.ReadOuterXml()))
                    {
                        XmlSerializer deserializer =
                            serializers[typeof (featureClassType)];
                        featureClasses.Add(
                            (featureClassType)
                            deserializer.Deserialize(elementReader));
                    }
                    continue;
                    // ReadOuterXml advances the reader, so don't read again.
                }

                if (reader.Name.Equals("feature"))
                {
                    using (
                        StringReader elementReader =
                            new StringReader(reader.ReadOuterXml()))
                    {
                        XmlSerializer deserializer =
                            serializers[typeof (featureType)];
                        features.Add(
                            (featureType)
                            deserializer.Deserialize(elementReader));
                    }
                    continue;
                    // ReadOuterXml advances the reader, so don't read again.
                }

                log.write(
                    "WARNING: unknown element '" + reader.Name +
                    "' was skipped during parsing.");
                reader.Read(); // skip non-element-nodes and unknown-elements.
            }
            result.featureClasses = featureClasses.ToArray();
            result.features = features.ToArray();
            return result;
        }

        private XmlSerializer newSerializer(Type elementType)
        {
            XmlSerializer serializer = new XmlSerializer(elementType);
            serializer.UnknownNode += new XmlNodeEventHandler(handleUnknownNode);
            serializer.UnknownAttribute += 
                new XmlAttributeEventHandler(handleUnknownAttribute);
            return serializer;
        }

        protected void handleUnknownNode(object sender, XmlNodeEventArgs e)
        {
            log.write(
                string.Format(
                    "XML_ERROR: Unknown Node at line {0} position {1} : {2}\t{3}",
                    e.LineNumber, e.LinePosition, e.Name, e.Text));
        }

        protected void handleUnknownAttribute(object sender, XmlAttributeEventArgs e)
        {
            log.write(
                string.Format(
                    "XML_ERROR: Unknown Attribute at line {0} position {1} : {2}='{3}'",
                    e.LineNumber, e.LinePosition, e.Attr.Name, e.Attr.Value));
        }
    }
}

Just to summarise, and make the answer a bit more obvious for anyone who finds this thread in google.总结一下,让在谷歌找到这个帖子的人更清楚答案。

Prior to .NET 2 the XmlTextReader was the most memory efficient XML parser available in the standard API (thanx Mitch;-)在 .NET 2 之前,XmlTextReader 是 memory 效率最高的 XML 解析器在标准 ZDB974238714CA-ACEDE4F8Z 中可用;

.NET 2 introduced the XmlReader class which is better again It's a forward-only element iterator (a bit like a StAX parser). .NET 2 引入了 XmlReader class,它再次变得更好,它是一个只进的元素迭代器(有点像 StAX 解析器)。 (thanx Cerebrus;-) (感谢 Cerebrus;-)

And remember kiddies, of any XML instance has the potential to be bigger than about 500k, DON'T USE DOM!记住孩子们,任何 XML 实例都有可能大于 500k,不要使用 DOM!

Cheers all.祝大家欢呼。 Keith.基思。

A SAX parser might be what you're looking for. SAX解析器可能是您正在寻找的。 SAX does not require you to read the entire document into memory - it parses through it incrementally and allows you to process the elements as you go. SAX 不需要您将整个文档读入 memory - 它会逐步解析它并允许您像 go 一样处理元素。 I don't know if there is a SAX parser provided in .NET, but there are a few opensource options that you could look at:我不知道 .NET 中是否提供了 SAX 解析器,但是您可以查看一些开源选项:

Here's a related post:这是一个相关的帖子:

Just wanted to add this simple extension method as an example of using XmlReader (as Mitch answered):只是想添加这个简单的扩展方法作为使用 XmlReader 的示例(正如 Mitch 回答的那样):

public static bool SkipToElement (this XmlReader xmlReader, string elementName)
{
    if (!xmlReader.Read ())
        return false;

    while (!xmlReader.EOF)
    {
        if (xmlReader.NodeType == XmlNodeType.Element && xmlReader.Name == elementName)
            return true;

        xmlReader.Skip ();
    }

    return false;
}

And usage:和用法:

using (var xml_reader = XmlReader.Create (this.source.Url))
{
    if (!SkipToElement (xml_reader, "Root"))
        throw new InvalidOperationException ("XML element \"Root\" was not found.");

    if (!SkipToElement (xml_reader, "Users"))
        throw new InvalidOperationException ("XML element \"Root/Users\" was not found.");

    ...
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM