在python中使用lxml iterparse解析一個大的.bz2文件（40 GB）。未壓縮文件未出現的錯誤

Question

我試圖解析OpenStreetMap的planet.osm，以bz2格式壓縮。 因為它已經是41G，所以我不想完全解壓縮文件。

所以我想出了如何使用bz2和lxml解析planet.osm文件的部分，使用以下代碼

from lxml import etree as et
from bz2 import BZ2File

path = "where/my/fileis.osm.bz2"
with BZ2File(path) as xml_file:
    parser = et.iterparse(xml_file, events=('end',))
    for events, elem in parser:

        if elem.tag == "tag":
            continue
        if elem.tag == "node":
            (do something)


    ## Do some cleaning
    # Get rid of that element
    elem.clear()

    # Also eliminate now-empty references from the root node to node        
    while elem.getprevious() is not None:
        del elem.getparent()[0]

與Geofabrick提取物完美配合。 但是，當我嘗試使用相同的腳本解析planet-latestm.bz2時，我得到錯誤：

xml.etree.XMLSyntaxError：屬性num_change的規范授權值，第3684行，第60列

以下是我嘗試過的事情：

檢查planet-latest.osm.bz2 md5sum
檢查帶有bz2的腳本停止的planet-latest.osm。 沒有明顯的錯誤，該屬性稱為“num_changes”，而不是錯誤中指示的“num_change”
我也做了一些愚蠢的事，但錯誤讓我困惑：我在模式'rb'中打開了planet -osm.bz2 [c = BZ2File（'file.osm.bz2'，'rb'）]然后通過了c。 read（）到iterparse（），它返回了一個錯誤說（很長的字符串）無法打開。 奇怪的是，（非常長的字符串）在“規范授權值”錯誤引用的地方結束......

然后我嘗試首先解壓縮planet.osm.gz2

bzcat planet.osm.gz2 > planet.osm

並直接在planet.osm上運行解析器。 而且......它奏效了！ 我對此感到非常困惑，並且找不到任何可能發生這種情況的指針以及如何解決這個問題。 我的猜測是解壓縮和解析之間會發生一些事情，但我不確定。 請幫我理解！

Answer 1

事實證明，問題在於壓縮的planet.osm文件。

如OSM Wiki所示，行星文件被壓縮為多流文件 ，而bz2 python模塊無法讀取多流文件。 但是，bz2文檔指示可以讀取此類文件的替代模塊bz2file 。 我用過它，它完美無缺！

所以代碼應該是：

from lxml import etree as et
from bz2file import BZ2File

path = "where/my/fileis.osm.bz2"
with BZ2File(path) as xml_file:
    parser = et.iterparse(xml_file, events=('end',))
    for events, elem in parser:

        if elem.tag == "tag":
            continue
        if elem.tag == "node":
            (do something)


    ## Do some cleaning
    # Get rid of that element
    elem.clear()

    # Also eliminate now-empty references from the root node to node        
    while elem.getprevious() is not None:
        del elem.getparent()[0]

另外，做了一些關於使用PBF格式的研究（正如評論中所建議的），我偶然發現了imposm.parser ，一個python模塊，它實現了OSM數據的通用解析器（以pbf或xml格式）。 你可能想看看這個！

Answer 2

作為替代方案，您可以使用bzcat命令的輸出（它也可以處理多流文件）：

p = subprocess.Popen(["bzcat", "data.bz2"], stdout=subprocess.PIPE)
parser = et.iterparse(p.stdout, ...)
# at the end just check that p.returncode == 0 so there were no errors

在python中使用lxml iterparse解析一個大的.bz2文件（40 GB）。未壓縮文件未出現的錯誤

問題描述

2 個解決方案

解決方案1
5 已采納 2015-04-03 14:32:44

解決方案2
2 2015-04-03 14:41:07

在python中使用lxml iterparse解析一個大的.bz2文件（40 GB）。 未壓縮文件未出現的錯誤

問題描述

2 個解決方案

解決方案1 5 已采納 2015-04-03 14:32:44

解決方案2 2 2015-04-03 14:41:07

在python中使用lxml iterparse解析一個大的.bz2文件（40 GB）。未壓縮文件未出現的錯誤

解決方案1
5 已采納 2015-04-03 14:32:44

解決方案2
2 2015-04-03 14:41:07