简体   繁体   English

读取XML文件(文件大小> 500 MB)

[英]Reading XML File ( File size > 500 MB)

I'm trying to parse large XML file (size near about 600MB) and using 我正在尝试解析大型XML文件(大小约为600MB)并使用

It's taking longer time and finally, the entire process is aborted. 这需要更长的时间,最后整个过程被中止。 The process is ending with an exception. 该过程以异常结束。

Message: "Thread is aborted" 消息: “线程被中止”

Method: 方法:

private string ReadXml(XmlTextReader reader, string fileName)
{
    string finalXML = "";
    string s1 = "";
    try
    {
        while (reader.Read())
        {
            switch (reader.NodeType)
            {
                case XmlNodeType.Element: // The node is an element.
                    s1 += "<" + reader.Name + ">";
                    break;
                case XmlNodeType.Text: //Display the text in each element.
                    s1 += reader.Value;
                    break;
                case XmlNodeType.EndElement: //Display the end of the element.
                    s1 += "</" + reader.Name + ">";
                    break;
            }
            finalXML = s1;
        }
    }
    catch (Exception ex)
    {
       Logger.Logger.LogMessage(ex, "File Processing error: " + fileName);
    }
    reader.Close();
    reader.Dispose();

    return finalXML;
}

And then reading and desalinizing: 然后阅读和脱盐:

string finalXML = string.Empty;
XmlTextReader reader = new XmlTextReader(unzipfile);
finalXML = await ReadXml(reader, fileName);

var xmlremovenamespae = Helper.RemoveAllNamespaces(finalXML);
XmlParseObjectNew.BizData myxml = new XmlParseObjectNew.BizData();

using (StringReader sr = new StringReader(xmlremovenamespae))
 {
       XmlSerializer serializer = new XmlSerializer(typeof(XmlParseObjectNew.BizData));
       myxml = (XmlParseObjectNew.BizData)serializer.Deserialize(sr);
 }

Is there any better way to read & parse large xml file? 有没有更好的方法来读取和解析大型xml文件? need a suggestion. 需要一个建议。

The problem is, as mentioned by Jon Skeet and DiskJunky, that your dataset is simply too large to load into memory and your code not optimized for handling this. 正如Jon Skeet和DiskJunky所提到的那样,问题是您的数据集太大而无法加载到内存中,并且您的代码没有针对此问题进行优化。 Hence why various classes are throwing you an 'out of memory exception'. 因此,为什么各种类都会向您抛出“内存不足异常”。

First of all, string concatenation. 首先,字符串串联。 Using simple concatenation (a + b) with multiple strings is usually a bad idea due to the way strings work. 由于字符串的工作方式,对多个字符串使用简单的串联(a + b)通常是个坏主意。 I would recommend looking up online how to handle string concatenation effectively (for example, Jon Skeet's Concatenating Strings Efficiently ). 我建议在网上查找如何有效地处理字符串连接(例如,Jon Skeet的“有效地连接字符串” )。

However this is optimization of your code, the main issue is the sheer size of the XML file you are trying to load into memory. 但这是对代码的优化,主要问题是您试图加载到内存中的XML文件的绝对大小。 To handle large datasets it is usually better if you can 'stream' the data, processing chunks of data instead of the entire file. 为了处理大型数据集,通常最好是“流式处理”数据,处理数据块而不是整个文件。


As you have not shown an example of your XML, I took the liberty of making a simple example to illustrate what I mean. 由于您没有显示XML的示例,因此我自由地制作了一个简单的示例来说明我的意思。

Consider you have the following XML: 考虑您具有以下XML:

<root>
   <specialelement>
      <value1>somevalue</value1>
      <value2>somevalue</value2>
   </specialelement>
   <specialelement>
      <value1>someothervalue</value1>
      <value2>someothervalue</value2>
   </specialelement>
   ... 
</root>

Of this XML you want to parse the specialelement into an object, with the following class definition: 这个XML的要解析specialelement为对象,用下面的类定义:

[XmlRoot("specialelement")]
public class ExampleClass
{
    [XmlElement(ElementName = "value1")]
    public string Value1 { get; set; }    
    [XmlElement(ElementName = "value2")]
    public string Value2 { get; set; }
}

I'll assume we can process each SpecialElement individually, and define a handler for this as follows: 我假设我们可以分别处理每个SpecialElement ,并为此定义一个处理程序,如下所示:

public void HandleElement(ExampleClass item)
{
    // Process stuff
}

Now we can use the XmlTextReader to read each element in the XML individually, when we reach our specialelement we keep track of the data that is contained within the XML element. 现在,我们可以使用XmlTextReader分别读取XML中的每个元素,当我们达到specialelement我们将跟踪XML元素中包含的数据。 When we reach the end of our specialelement we deserialize it into an object and send it to our handler for processing. 当我们到达specialelement的末尾时,我们将其反序列化为一个对象,并将其发送给我们的处理程序进行处理。 For example: 例如:

using (var reader = new XmlTextReader( /* your inputstream */ ))
{
    // Buffer for the element contents
    StringBuilder sb = new StringBuilder(1000);

    // Read till next node
    while (reader.Read())
    {
        switch (reader.NodeType)
        {
            case XmlNodeType.Element: 
                // Clear the stringbuilder when we start with our element 
                if (string.Equals(reader.Name, "specialelement"))
                {
                    sb.Clear();
                }

                // Append current element without namespace
                sb.Append("<").Append(reader.Name).Append(">");
                break;

            case XmlNodeType.Text: //Display the text in each element.
                sb.Append(reader.Value);
                break;

            case XmlNodeType.EndElement: 

                // Append the closure element
                sb.Append("</").Append(reader.Name).Append(">");

                // Check if we have finished reading our element
                if (string.Equals(reader.Name, "specialelement"))
                {
                    // The stringbuilder now contains the entire 'SpecialElement' part
                    using (TextReader textReader = new StringReader(sb.ToString()))
                    {
                        // Deserialize
                        var deserializedElement = (ExampleClass)serializer.Deserialize(textReader);
                        // Send to handler
                        HandleElement(deserializedElement);
                    }
                }

                break;
        }
    }
}

As we start processing the data as it comes in from the stream, we do not have to load the entire file into memory. 当我们开始处理流中的数据时,我们不必将整个文件加载到内存中。 Keeping the memory usage of the program low (preventing out-of-memory exceptions). 保持程序的内存使用率较低(防止内存不足异常)。

Checkout this fiddle to see it in action. 查看这个小提琴 ,看看它的实际效果。

Note that this a quick example, there are still plenty of places where you can improve and optimize this code further. 请注意,这是一个快速示例,仍然有很多地方可以进一步改进和优化此代码。

I try this and working fine. 我尝试这个并且工作正常。

fileName = "your file path"; fileName =“您的文件路径”;

Try this code ,its parsing greater than 500MB XML file within few second. 试试这个代码,它可以在几秒钟内解析出大于500MB的XML文件。

using (TextReader textReader = new StreamReader(fileName))
  {
    using (XmlTextReader reader = new XmlTextReader(textReader))
      {                                   
       reader.Namespaces = false;
 XmlSerializer serializer = new XmlSerializer(typeof("YourXmlClassType"));
          parseData = ("YourXmlClassType")serializer.Deserialize(reader);
      }
  }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM