简体   繁体   English

在python中使用lxml iterparse解析一个大的.bz2文件(40 GB)。 未压缩文件未出现的错误

[英]Parsing a large .bz2 file (40 GB) with lxml iterparse in python. Error that does not appear with uncompressed file

I am trying to parse OpenStreetMap's planet.osm, compressed in bz2 format. 我试图解析OpenStreetMap的planet.osm,以bz2格式压缩。 Because it is already 41G, I don't want to decompress the file completely. 因为它已经是41G,所以我不想完全解压缩文件。

So I figured out how to parse portions of the planet.osm file using bz2 and lxml, using the following code 所以我想出了如何使用bz2和lxml解析planet.osm文件的部分,使用以下代码

from lxml import etree as et
from bz2 import BZ2File

path = "where/my/fileis.osm.bz2"
with BZ2File(path) as xml_file:
    parser = et.iterparse(xml_file, events=('end',))
    for events, elem in parser:

        if elem.tag == "tag":
            continue
        if elem.tag == "node":
            (do something)


    ## Do some cleaning
    # Get rid of that element
    elem.clear()

    # Also eliminate now-empty references from the root node to node        
    while elem.getprevious() is not None:
        del elem.getparent()[0]

which works perfectly with the Geofabrick extracts . Geofabrick提取物完美配合。 However, when I try to parse the planet-latest.osm.bz2 with the same script I get the error: 但是,当我尝试使用相同的脚本解析planet-latestm.bz2时,我得到错误:

xml.etree.XMLSyntaxError: Specification mandate value for attribute num_change, line 3684, column 60 xml.etree.XMLSyntaxError:属性num_change的规范授权值,第3684行,第60列

Here are the things I tried: 以下是我尝试过的事情:

  • Check the planet-latest.osm.bz2 md5sum 检查planet-latest.osm.bz2 md5sum
  • Check the planet-latest.osm where the script with bz2 stops. 检查带有bz2的脚本停止的planet-latest.osm。 There is no apparent error, and the attribute is called "num_changes", not "num_change" as indicated in the error 没有明显的错误,该属性称为“num_changes”,而不是错误中指示的“num_change”
  • Also I did something stupid, but the error puzzled me: I opened the planet-latest.osm.bz2 in mode 'rb' [c = BZ2File('file.osm.bz2', 'rb')] and then passed c.read() to iterparse(), which returned me an error saying (very long string) cannot be opened. 我也做了一些愚蠢的事,但错误让我困惑:我在模式'rb'中打开了planet -osm.bz2 [c = BZ2File('file.osm.bz2','rb')]然后通过了c。 read()到iterparse(),它返回了一个错误说(很长的字符串)无法打开。 Strange thing, (very long string) ends right where the "Specification mandate value" error refers to... 奇怪的是,(非常长的字符串)在“规范授权值”错误引用的地方结束......

Then I tried to decompress first the planet.osm.gz2 usin a simple 然后我尝试首先解压缩planet.osm.gz2

bzcat planet.osm.gz2 > planet.osm

And ran the parser directly on planet.osm. 并直接在planet.osm上运行解析器。 And... it worked! 而且......它奏效了! I am very puzzled by this, and could not find any pointer to why this may happen and how to solve this. 我对此感到非常困惑,并且找不到任何可能发生这种情况的指针以及如何解决这个问题。 My guess would be there is something going on between the decompression and the parsing, but I am not sure. 我的猜测是解压缩和解析之间会发生一些事情,但我不确定。 Please help me understand! 请帮我理解!

It turns out that the problem is with the compressed planet.osm file. 事实证明,问题在于压缩的planet.osm文件。

As indicated on the OSM Wiki , the planet file is compressed as a multistream file , and the bz2 python module cannot read multistream files. OSM Wiki所示 ,行星文件被压缩为多流文件 ,而bz2 python模块无法读取多流文件。 However, the bz2 documentation indicates an alternative module that can read such files, bz2file . 但是,bz2文档指示可以读取此类文件的替代模块bz2file I used it and it works perfectly! 我用过它,它完美无缺!

So the code should read: 所以代码应该是:

from lxml import etree as et
from bz2file import BZ2File

path = "where/my/fileis.osm.bz2"
with BZ2File(path) as xml_file:
    parser = et.iterparse(xml_file, events=('end',))
    for events, elem in parser:

        if elem.tag == "tag":
            continue
        if elem.tag == "node":
            (do something)


    ## Do some cleaning
    # Get rid of that element
    elem.clear()

    # Also eliminate now-empty references from the root node to node        
    while elem.getprevious() is not None:
        del elem.getparent()[0]

Also, doing some research on using the PBF format (as advised in the comments), I stumbled upon imposm.parser , a python module that implements a generic parser for OSM data (in pbf or xml format). 另外,做了一些关于使用PBF格式的研究(正如评论中所建议的),我偶然发现了imposm.parser ,一个python模块,它实现了OSM数据的通用解析器(以pbf或xml格式)。 You may want to have a look at this! 你可能想看看这个!

As an alternative you can use the output of bzcat command (which can handle multistream files too): 作为替代方案,您可以使用bzcat命令的输出(它也可以处理多流文件):

p = subprocess.Popen(["bzcat", "data.bz2"], stdout=subprocess.PIPE)
parser = et.iterparse(p.stdout, ...)
# at the end just check that p.returncode == 0 so there were no errors

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM