[英]Process very large XML file
I need to process an XML file with the following structure: 我需要处理具有以下结构的XML文件:
<FolderSizes>
<Version></Version>
<DateTime Un=""></DateTime>
<Summary>
<TotalSize Bytes=""></TotalSize>
<TotalAllocated Bytes=""></TotalAllocated>
<TotalAvgFileSize Bytes=""></TotalAvgFileSize>
<TotalFolders Un=""></TotalFolders>
<TotalFiles Un=""></TotalFiles>
</Summary>
<DiskSpaceInfo>
<Drive Type="" Total="" TotalBytes="" Free="" FreeBytes="" Used=""
UsedBytes=""><![CDATA[ ]]></Drive>
</DiskSpaceInfo>
<Folder ScanState="">
<FullPath Name=""><![CDATA[ ]]></FullPath>
<Attribs Int=""></Attribs>
<Size Bytes=""></Size>
<Allocated Bytes=""></Allocated>
<AvgFileSz Bytes=""></AvgFileSz>
<Folders Un=""></Folders>
<Files Un=""></Files>
<Depth Un=""></Depth>
<Created Un=""></Created>
<Accessed Un=""></Accessed>
<LastMod Un=""></LastMod>
<CreatedCalc Un=""></CreatedCalc>
<AccessedCalc Un=""></AccessedCalc>
<LastModCalc Un=""></LastModCalc>
<Perc><![CDATA[ ]]></Perc>
<Owner><![CDATA[ ]]></Owner>
<!-- Special element; see paragraph below -->
<Folder></Folder>
</Folder>
</FolderSizes>
The <Folder>
element is special in that it repeats within the <FolderSizes>
element but can also appear within itself; <Folder>
元素的特殊之处在于它在<FolderSizes>
元素中重复,但也可以在其自身内部出现; I reckon up to about 5 levels. 我估计大约有5个级别。
The problem is that the file is really big at a whopping 11GB so I'm having difficulty processing it - I have experience with XML documents, but nothing on this scale. 问题是文件真的很大,高达11GB,所以我很难处理它 - 我有XML文档的经验,但没有这个规模。
What I would like to do is to import the information into a SQL database because then I will be able to process the information in any way necessary without having to concern myself with this immense, impractical file. 我想要做的是将信息导入SQL数据库,因为这样我就能以任何必要的方式处理信息,而不必关心这个巨大的,不切实际的文件。
Here are the things I have tried: 以下是我尝试过的事情:
<Folder>
elements. <Folder>
元素。 This went quite well - I think better than the other two approaches - until one of the <Folder>
elements ended up being rather big, producing a An XML operation resulted an XML data type exceeding 2GB in size. Operation aborted.
<Folder>
元素最终变得相当大,产生一个An XML operation resulted an XML data type exceeding 2GB in size. Operation aborted.
An XML operation resulted an XML data type exceeding 2GB in size. Operation aborted.
error. Here are more things I think I should try: 以下是我认为应该尝试的更多内容:
I thought I'd ask for some advice before I go any further, possibly wasting my time. 在我走得更远之前,我想我会先征求一些意见,可能会浪费我的时间。
Thanks in advance for you time and assistance. 在此先感谢您的时间和帮助。
EDIT 编辑
So before I start processing the file I run through it and check the size in a attempt to provide the user with feedback as to how long the processing might take; 因此,在我开始处理文件之前,我会检查它并检查大小,以便向用户提供有关处理可能需要多长时间的反馈; I made a screenshot of the calculation:
我做了一个计算的截图:
That's about 1500 lines per second; 这大约是每秒1500行; if the average line length is about 50 characters, that's 50 bytes per line, that's 75 kilobytes per second, for an 11GB file should take about 40 hours, if my maths is correct.
如果我的数学是正确的,如果平均行长度大约为50个字符,即每行50个字节,即每秒75千字节,对于11GB文件应该需要大约40个小时。 But this is only stepping each line.
但这只是踩到每一行。 It's not actually processing the line or doing anything with it, so when that starts, the processing rate drops significantly.
它实际上并不是处理线路或对它做任何事情,因此当它开始时,处理速率会显着下降。
This is the method that runs during the size calculation: 这是在大小计算期间运行的方法:
private int _totalLines = 0;
private bool _cancel = false; // set to true when the cancel button is clicked
private void CalculateFileSize()
{
xmlStream = new StreamReader(_filePath);
xmlReader = new XmlTextReader(xmlStream);
while (xmlReader.Read())
{
if (_cancel)
return;
if (xmlReader.LineNumber > _totalLines)
_totalLines = xmlReader.LineNumber;
InterThreadHelper.ChangeText(
lblLinesRemaining,
string.Format("{0} lines", _totalLines));
string elapsed = string.Format(
"{0}:{1}:{2}:{3}",
timer.Elapsed.Days.ToString().PadLeft(2, '0'),
timer.Elapsed.Hours.ToString().PadLeft(2, '0'),
timer.Elapsed.Minutes.ToString().PadLeft(2, '0'),
timer.Elapsed.Seconds.ToString().PadLeft(2, '0'));
InterThreadHelper.ChangeText(lblElapsed, elapsed);
if (_cancel)
return;
}
xmlStream.Dispose();
}
Still runnig, 27 minutes in :( 仍然runnig,27分钟在:(
you can read an XML as a logical stream of elements instead of trying to read it line-by-line and piece it back together yourself. 您可以将XML作为元素的逻辑流读取,而不是尝试逐行读取并将其重新组合在一起。 see the code sample at the end of this article
请参阅本文末尾的代码示例
also, your question has already been asked here 此外,您的问题已在此处提出
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.