读取非常大的.xml.bz2文件

Question

I'd like to parse Wikimedia's .xml.bzip2 dumps without extracting the entire file or performing any XML validation: 我想解析Wikimedia的.xml.bzip2转储，而不提取整个文件或执行任何XML验证：

var filename = "enwiki-20160820-pages-articles.xml.bz2";

var settings = new XmlReaderSettings()
{
    ValidationType = ValidationType.None,
    ConformanceLevel = ConformanceLevel.Auto // Fragment ?
};

using (var stream = File.Open(filename, FileMode.Open))
using (var bz2 = new BZip2InputStream(stream))
using (var xml = XmlTextReader.Create(bz2, settings))
{
    xml.ReadToFollowing("page");
    // ...
}

The BZip2InputStream works - if I use a StreamReader , I can read XML line by line. BZip2InputStream可以正常工作-如果我使用StreamReader ，则可以逐行读取XML。 But when I use XmlTextReader , it fails when I try to perform the read: 但是当我使用XmlTextReader ，当我尝试执行读取操作时它将失败：

System.Xml.XmlException: 'Unexpected end of file has occurred. System.Xml.XmlException：'发生了意外的文件结尾。 The following elements are not closed: mediawiki. 以下元素未关闭：mediawiki。 Line 58, position 1.' 第58行，位置1。

The bzip stream is not at EOF. bzip流不在 EOF。 Is it possible to open an XmlTextReader on top of a BZip2 stream? 是否可以在BZip2流顶部打开XmlTextReader？ Or is there some other means to do this? 还是有其他方法可以做到这一点？

Answer 1

This should work. 这应该工作。 I used combination of XmlReader and Xml Linq. 我使用了XmlReader和Xml Linq的组合。 You can parse the XElement doc as needed. 您可以根据需要解析XElement文档。

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;


namespace ConsoleApplication29
{
    class Program
    {
        const string URL = @"https://dumps.wikimedia.org/enwiki/20160820/enwiki-20160820-abstract26.xml";
        static void Main(string[] args)
        {
            XmlReader reader = XmlReader.Create(URL);

            while (!reader.EOF)
            {
                if (reader.Name != "doc")
                {
                    reader.ReadToFollowing("doc");
                }
                if (!reader.EOF)
                {
                    XElement doc = (XElement)XElement.ReadFrom(reader);
                }
            }

        }
    }
}

读取非常大的.xml.bz2文件

问题描述

1 个解决方案

解决方案1
0 2016-12-03 17:12:49

读取非常大的.xml.bz2文件

问题描述

1 个解决方案

解决方案1 0 2016-12-03 17:12:49

解决方案1
0 2016-12-03 17:12:49