简体   繁体   English

处理非常大的XML文件

[英]Process very large XML file

I need to process an XML file with the following structure: 我需要处理具有以下结构的XML文件:

<FolderSizes>
    <Version></Version>
    <DateTime Un=""></DateTime>
    <Summary>
        <TotalSize Bytes=""></TotalSize>
        <TotalAllocated Bytes=""></TotalAllocated>
        <TotalAvgFileSize Bytes=""></TotalAvgFileSize>
        <TotalFolders Un=""></TotalFolders>
        <TotalFiles Un=""></TotalFiles>
    </Summary>
    <DiskSpaceInfo>
        <Drive Type="" Total="" TotalBytes="" Free="" FreeBytes="" Used=""
               UsedBytes=""><![CDATA[ ]]></Drive>
    </DiskSpaceInfo>
    <Folder ScanState="">
        <FullPath Name=""><![CDATA[ ]]></FullPath>
        <Attribs Int=""></Attribs>
        <Size Bytes=""></Size>
        <Allocated Bytes=""></Allocated>
        <AvgFileSz Bytes=""></AvgFileSz>
        <Folders Un=""></Folders>
        <Files Un=""></Files>
        <Depth Un=""></Depth>
        <Created Un=""></Created>
        <Accessed Un=""></Accessed>
        <LastMod Un=""></LastMod>
        <CreatedCalc Un=""></CreatedCalc>
        <AccessedCalc Un=""></AccessedCalc>
        <LastModCalc Un=""></LastModCalc>
        <Perc><![CDATA[ ]]></Perc>
        <Owner><![CDATA[ ]]></Owner>

        <!-- Special element; see paragraph below -->
        <Folder></Folder>
    </Folder>
</FolderSizes>

The <Folder> element is special in that it repeats within the <FolderSizes> element but can also appear within itself; <Folder>元素的特殊之处在于它在<FolderSizes>元素中重复,但也可以在其自身内部出现; I reckon up to about 5 levels. 我估计大约有5个级别。

The problem is that the file is really big at a whopping 11GB so I'm having difficulty processing it - I have experience with XML documents, but nothing on this scale. 问题是文件真的很大,高达11GB,所以我很难处理它 - 我有XML文档的经验,但没有这个规模。

What I would like to do is to import the information into a SQL database because then I will be able to process the information in any way necessary without having to concern myself with this immense, impractical file. 我想要做的是将信息导入SQL数据库,因为这样我就能以任何必要的方式处理信息,而不必关心这个巨大的,不切实际的文件。

Here are the things I have tried: 以下是我尝试过的事情:

  • Simply load the file and attempt to process it with a simple C# program using an XmlDocument or XDocument object 只需加载文件并尝试使用XmlDocument或XDocument对象通过简单的C#程序处理它
    • Before I even started I knew this would not work, as I'm sure everyone would agree, but I tried it anyway, and ran the application on a VM (since my notebook only has 4GB RAM) with 30GB memory. 在我开始之前,我知道这不起作用,因为我确信每个人都会同意,但无论如何我都尝试过,并在VM上运行应用程序(因为我的笔记本只有4GB RAM),内存为30GB。 The application ended up using 24GB memory, and taking very, very long, so I just cancelled it. 该应用程序最终使用24GB内存,并且非常长,所以我刚刚取消它。
  • Attempt to process the file using an XmlReader object 尝试使用XmlReader对象处理文件
    • This approach worked better in that it didn't use as much memory, but I still had a few problems: 这种方法效果更好,因为它没有使用尽可能多的内存,但我仍然遇到了一些问题:
      • It was taking really long because I was reading the file one line at a time. 这花了很长时间,因为我一次只读一行文件。
      • Processing the file one line at a time makes it difficult to really work with the data contained in the XML because now you have to detect the start of a tag, and then the end of that tag (hopefully), and then create a document from that information, read the info, attempt to determine which parent tag it belongs to because we have multiple levels... Sound prone to problems and errors 一次处理一行文件使得很难真正处理XML中包含的数据,因为现在您必须检测标记的开头,然后检测该标记的结尾(希望如此),然后从中创建文档该信息,读取信息,尝试确定它属于哪个父标签,因为我们有多个级别......声音容易出现问题和错误
      • Did I mention it takes really long reading the file one line at a time; 我提到它需要花费很长时间才能读取文件一行; and that still without actually processing that line - literally just reading it. 并且仍然没有实际处理该行 - 字面上只是阅读它。
  • Import the information using SQL Server 使用SQL Server导入信息
    • I created a stored procedure using XQuery and running it recursively within itself processing the <Folder> elements. 我使用XQuery创建了一个存储过程,并在其自身内递归运行处理<Folder>元素。 This went quite well - I think better than the other two approaches - until one of the <Folder> elements ended up being rather big, producing a An XML operation resulted an XML data type exceeding 2GB in size. Operation aborted. 这很顺利 - 我认为比其他两种方法更好 - 直到其中一个<Folder>元素最终变得相当大,产生一个An XML operation resulted an XML data type exceeding 2GB in size. Operation aborted. An XML operation resulted an XML data type exceeding 2GB in size. Operation aborted. error. 错误。 I read up about it and I don't think it's an adjustable limit. 我读到了它,我不认为这是一个可调节的限制。

Here are more things I think I should try: 以下是我认为应该尝试的更多内容:

  • Re-write my C# application to use unmanaged code 重写我的C#应用​​程序以使用非托管代码
    • I don't have much experience with unmanaged code, so I'm not sure how well it will work and how to make it as unmanaged as possible. 我对非托管代码没有太多经验,所以我不确定它将如何工作以及如何使其尽可能不受管理。
    • I once wrote a little application that works with my webcam, receiving the image, inverting the colours, and painting it to a panel. 我曾经写过一个与我的网络摄像头配合使用的小应用程序,接收图像,反转颜色,并将其绘制到面板上。 Using normal managed code didn't work - the result was about 2 frames per second. 使用普通的托管代码不起作用 - 结果大约是每秒2帧。 Re-writing the colour inversion method to use unmanaged code solved the problem. 重写颜色反转方法使用非托管代码解决了问题。 That's why I thought that unmanaged might be a solution. 这就是为什么我认为不受管理可能是一个解决方案。
  • Rather go for C++ in stead of C# 而是去C ++而不是C#
    • Not sure if this is really a solution. 不确定这是否真的是一个解决方案。 Would it necessarily be better that C#? C#一定会更好吗? Better than unmanaged C#? 比非托管C#更好?
    • The problem here is that I haven't actually worked with C++ before, so I'll need to get to know a few things about C++ before I can really start working with it, and then probably not very efficiently yet. 这里的问题是我以前没有真正使用过C ++,所以在真正开始使用C ++之前我需要先了解一些关于C ++的知识,然后可能还不是很有效。

I thought I'd ask for some advice before I go any further, possibly wasting my time. 在我走得更远之前,我想我会先征求一些意见,可能会浪费我的时间。

Thanks in advance for you time and assistance. 在此先感谢您的时间和帮助。

EDIT 编辑

So before I start processing the file I run through it and check the size in a attempt to provide the user with feedback as to how long the processing might take; 因此,在我开始处理文件之前,我会检查它并检查大小,以便向用户提供有关处理可能需要多长时间的反馈; I made a screenshot of the calculation: 我做了一个计算的截图:

18分钟; 1.67mil线

That's about 1500 lines per second; 这大约是每秒1500行; if the average line length is about 50 characters, that's 50 bytes per line, that's 75 kilobytes per second, for an 11GB file should take about 40 hours, if my maths is correct. 如果我的数学是正确的,如果平均行长度大约为50个字符,即每行50个字节,即每秒75千字节,对于11GB文件应该需要大约40个小时。 But this is only stepping each line. 但这只是踩到每一行。 It's not actually processing the line or doing anything with it, so when that starts, the processing rate drops significantly. 它实际上并不是处理线路或对它做任何事情,因此当它开始时,处理速率会显着下降。

This is the method that runs during the size calculation: 这是在大小计算期间运行的方法:

    private int _totalLines = 0;
    private bool _cancel = false; // set to true when the cancel button is clicked

    private void CalculateFileSize()
    {
        xmlStream = new StreamReader(_filePath);
        xmlReader = new XmlTextReader(xmlStream);

        while (xmlReader.Read())
        {
            if (_cancel)
                return;

            if (xmlReader.LineNumber > _totalLines)
                _totalLines = xmlReader.LineNumber;

            InterThreadHelper.ChangeText(
                lblLinesRemaining, 
                string.Format("{0} lines", _totalLines));

            string elapsed = string.Format(
                "{0}:{1}:{2}:{3}",
                timer.Elapsed.Days.ToString().PadLeft(2, '0'),
                timer.Elapsed.Hours.ToString().PadLeft(2, '0'),
                timer.Elapsed.Minutes.ToString().PadLeft(2, '0'),
                timer.Elapsed.Seconds.ToString().PadLeft(2, '0'));

            InterThreadHelper.ChangeText(lblElapsed, elapsed);

            if (_cancel)
                return;
        }

        xmlStream.Dispose();
    }

Still runnig, 27 minutes in :( 仍然runnig,27分钟在:(

you can read an XML as a logical stream of elements instead of trying to read it line-by-line and piece it back together yourself. 您可以将XML作为元素的逻辑流读取,而不是尝试逐行读取并将其重新组合在一起。 see the code sample at the end of this article 请参阅本文末尾的代码示例

also, your question has already been asked here 此外,您的问题已在此处提出

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM