[英]Parsing a large .bz2 file (40 GB) with lxml iterparse in python. Error that does not appear with uncompressed file
I am trying to parse OpenStreetMap's planet.osm, compressed in bz2 format. 我试图解析OpenStreetMap的planet.osm,以bz2格式压缩。 Because it is already 41G, I don't want to decompress the file completely.
因为它已经是41G,所以我不想完全解压缩文件。
So I figured out how to parse portions of the planet.osm file using bz2 and lxml, using the following code 所以我想出了如何使用bz2和lxml解析planet.osm文件的部分,使用以下代码
from lxml import etree as et
from bz2 import BZ2File
path = "where/my/fileis.osm.bz2"
with BZ2File(path) as xml_file:
parser = et.iterparse(xml_file, events=('end',))
for events, elem in parser:
if elem.tag == "tag":
continue
if elem.tag == "node":
(do something)
## Do some cleaning
# Get rid of that element
elem.clear()
# Also eliminate now-empty references from the root node to node
while elem.getprevious() is not None:
del elem.getparent()[0]
which works perfectly with the Geofabrick extracts . 与Geofabrick提取物完美配合。 However, when I try to parse the planet-latest.osm.bz2 with the same script I get the error:
但是,当我尝试使用相同的脚本解析planet-latestm.bz2时,我得到错误:
xml.etree.XMLSyntaxError: Specification mandate value for attribute num_change, line 3684, column 60
xml.etree.XMLSyntaxError:属性num_change的规范授权值,第3684行,第60列
Here are the things I tried: 以下是我尝试过的事情:
Then I tried to decompress first the planet.osm.gz2 usin a simple 然后我尝试首先解压缩planet.osm.gz2
bzcat planet.osm.gz2 > planet.osm
And ran the parser directly on planet.osm. 并直接在planet.osm上运行解析器。 And... it worked!
而且......它奏效了! I am very puzzled by this, and could not find any pointer to why this may happen and how to solve this.
我对此感到非常困惑,并且找不到任何可能发生这种情况的指针以及如何解决这个问题。 My guess would be there is something going on between the decompression and the parsing, but I am not sure.
我的猜测是解压缩和解析之间会发生一些事情,但我不确定。 Please help me understand!
请帮我理解!
It turns out that the problem is with the compressed planet.osm file. 事实证明,问题在于压缩的planet.osm文件。
As indicated on the OSM Wiki , the planet file is compressed as a multistream file , and the bz2 python module cannot read multistream files. 如OSM Wiki所示 ,行星文件被压缩为多流文件 ,而bz2 python模块无法读取多流文件。 However, the bz2 documentation indicates an alternative module that can read such files, bz2file .
但是,bz2文档指示可以读取此类文件的替代模块bz2file 。 I used it and it works perfectly!
我用过它,它完美无缺!
So the code should read: 所以代码应该是:
from lxml import etree as et
from bz2file import BZ2File
path = "where/my/fileis.osm.bz2"
with BZ2File(path) as xml_file:
parser = et.iterparse(xml_file, events=('end',))
for events, elem in parser:
if elem.tag == "tag":
continue
if elem.tag == "node":
(do something)
## Do some cleaning
# Get rid of that element
elem.clear()
# Also eliminate now-empty references from the root node to node
while elem.getprevious() is not None:
del elem.getparent()[0]
Also, doing some research on using the PBF format (as advised in the comments), I stumbled upon imposm.parser , a python module that implements a generic parser for OSM data (in pbf or xml format). 另外,做了一些关于使用PBF格式的研究(正如评论中所建议的),我偶然发现了imposm.parser ,一个python模块,它实现了OSM数据的通用解析器(以pbf或xml格式)。 You may want to have a look at this!
你可能想看看这个!
As an alternative you can use the output of bzcat
command (which can handle multistream files too): 作为替代方案,您可以使用
bzcat
命令的输出(它也可以处理多流文件):
p = subprocess.Popen(["bzcat", "data.bz2"], stdout=subprocess.PIPE)
parser = et.iterparse(p.stdout, ...)
# at the end just check that p.returncode == 0 so there were no errors
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.