[英]Reading very large .xml.bz2 files
I'd like to parse Wikimedia's .xml.bzip2 dumps without extracting the entire file or performing any XML validation: 我想解析Wikimedia的.xml.bzip2转储,而不提取整个文件或执行任何XML验证:
var filename = "enwiki-20160820-pages-articles.xml.bz2";
var settings = new XmlReaderSettings()
{
ValidationType = ValidationType.None,
ConformanceLevel = ConformanceLevel.Auto // Fragment ?
};
using (var stream = File.Open(filename, FileMode.Open))
using (var bz2 = new BZip2InputStream(stream))
using (var xml = XmlTextReader.Create(bz2, settings))
{
xml.ReadToFollowing("page");
// ...
}
The BZip2InputStream
works - if I use a StreamReader
, I can read XML line by line. BZip2InputStream
可以正常工作-如果我使用StreamReader
,则可以逐行读取XML。 But when I use XmlTextReader
, it fails when I try to perform the read: 但是当我使用
XmlTextReader
,当我尝试执行读取操作时它将失败:
System.Xml.XmlException: 'Unexpected end of file has occurred.
System.Xml.XmlException:'发生了意外的文件结尾。 The following elements are not closed: mediawiki.
以下元素未关闭:mediawiki。 Line 58, position 1.'
第58行,位置1。
The bzip stream is not at EOF. bzip流不在 EOF。 Is it possible to open an XmlTextReader on top of a BZip2 stream?
是否可以在BZip2流顶部打开XmlTextReader? Or is there some other means to do this?
还是有其他方法可以做到这一点?
This should work. 这应该工作。 I used combination of XmlReader and Xml Linq.
我使用了XmlReader和Xml Linq的组合。 You can parse the XElement doc as needed.
您可以根据需要解析XElement文档。
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
namespace ConsoleApplication29
{
class Program
{
const string URL = @"https://dumps.wikimedia.org/enwiki/20160820/enwiki-20160820-abstract26.xml";
static void Main(string[] args)
{
XmlReader reader = XmlReader.Create(URL);
while (!reader.EOF)
{
if (reader.Name != "doc")
{
reader.ReadToFollowing("doc");
}
if (!reader.EOF)
{
XElement doc = (XElement)XElement.ReadFrom(reader);
}
}
}
}
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.