简体   繁体   English

在C#中解析80 Gb XML文件

[英]Parsing 80 Gb XML File in c#

I have to parse 80 GB OF XML to get some data from that file. 我必须解析80 GB的XML才能从该文件中获取一些数据。 I have used XML reader for this purpose. 我已经为此目的使用了XML阅读器。 When I checked the code with 304 MB File. 当我用304 MB文件检查代码时。 Then it parse the file within 4 sec. 然后在4秒钟内解析文件。 So I thought I will work for 80 GB. 所以我想我将为80 GB工作。 But it is giving me the memory out of exception after some minute. 但这让我在几分钟后记忆异常。

I have the following code: 我有以下代码:

static void Main(string[] args)
    {

        List<Test> lstTest = new List<Test>();
        bool isTitle = false;
        bool isText = false;

        using (XmlReader Reader = XmlReader.Create(FilePath))
        {
            Test tt = new Test();
            while (Reader.Read())
            {                    switch (Reader.NodeType)
                {
                    case XmlNodeType.Element:
                        if (Reader.Name == "title")
                        {
                            isTitle = true;
                        }
                        if (Reader.Name == "text")
                        {
                            isText = true;
                        }
                        break;
                    case XmlNodeType.Text:
                        if (isTitle)
                        {
                            tt.Title = Reader.Value;
                            isTitle = false;
                        }

                        if (isText)
                        {
                            tt.Text = Reader.Value;
                            isText = false;
                        }
                        break;
                }

                if (tt.Text != null)
                {
                    lstTest.Add(tt);
                    tt = new Test();
                }
            }


        }
    }
}
}

So Please suggest. 因此,请提出建议。 Thanks For your help. 谢谢你的帮助。

You are correct, XmlReader is the right way to go. 没错, XmlReader是正确的方法。 And it's not the XmlReader that is running out of memory - it's your lstTest where you shove most nodes that you find. 内存耗尽的不是XmlReader ,而是您在lstTest找到的大多数节点的地方。

The correct way to use XmlReader would be to process the nodes and then forget about them, moving on. 使用XmlReader的正确方法是处理节点,然后将其XmlReader ,继续前进。 You can write the results to the disk, or calculate some running totals, or whatever - but don't keep everything you read in memory - that defeats the very purpose of XmlReader . 您可以将结果写入磁盘,或计算一些运行总计,或其他任何操作-但不要将您读取的所有内容都保留在内存中-这违背了XmlReader

You shouldn't store EVERYTHING into the memory, but only keep the parts that interests you. 您不应该将所有内容存储在内存中,而应仅保留您感兴趣的部分。

This can be done via IEnumerable<> and the yield return keyword: 这可以通过IEnumerable<>yield return关键字完成:

public IEnumerable<Test> ParseXml(string path)
{
    bool isTitle = false;
    bool isText = false;

    using (XmlReader Reader = XmlReader.Create(FilePath))
    {
        Test tt = new Test();
        while (Reader.Read())
        {                    
            switch (Reader.NodeType)
            {
                case XmlNodeType.Element:
                    if (Reader.Name == "title")
                    {
                        isTitle = true;
                    }
                    if (Reader.Name == "text")
                    {
                        isText = true;
                    }
                    break;

                case XmlNodeType.Text:
                    if (isTitle)
                    {
                        tt.Title = Reader.Value;
                        isTitle = false;
                    }

                    if (isText)
                    {
                        tt.Text = Reader.Value;
                        isText = false;
                    }
                    break;
            }

            if (tt.Text != null)
            {
                yield return tt;
                tt = new Test();
            }
        }
    }
}

Usage: 用法:

var data = ParseXml(/* your xml file */);

// select the part that you are interested in
var interestingTests = data
    .Where(x => x.Title == "...")

foreach (var test in interestingTests)
{
    // work with the interesting parts
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM