简体   繁体   English

使用LXML和Python解析空白XML标签

[英]Parsing blank XML tags with LXML and Python

When parsing XML documents in the format of: 以以下格式解析XML文档时:

<Car>
    <Color>Blue</Color>
    <Make>Chevy</Make>
    <Model>Camaro</Model>
</Car>

I use the following code: 我使用以下代码:

carData = element.xpath('//Root/Foo/Bar/Car/node()[text()]')
parsedCarData = [{field.tag: field.text for field in carData} for action in carData]
print parsedCarData[0]['Color'] #Blue

This code will not work if a tag is empty such as : 如果标签为空,则此代码将无效:

<Car>
    <Color>Blue</Color>
    <Make>Chevy</Make>
    <Model/>
</Car>

Using the same code as above: 使用与上面相同的代码:

carData = element.xpath('//Root/Foo/Bar/Car/node()[text()]')
parsedCarData = [{field.tag: field.text for field in carData} for action in carData]
print parsedCarData[0]['Model'] #Key Error

How would I parse this blank tag. 我将如何解析这个空白标签。

You're putting in a [text()] filter which explicitly asks only for elements which have text nodes them... and then you're unhappy when it doesn't give you elements without text nodes? 您要放置一个[text()]过滤器,该过滤器显式地仅询问具有文本节点的元素...然后,当它不给您没有文本节点的元素时,您会感到不满意?

Leave that filter out, and you'll get your model element: 除去该过滤器,您将获得模型元素:

>>> s='''
... <root>
...   <Car>
...     <Color>Blue</Color>
...     <Make>Chevy</Make>
...     <Model/>
...   </Car>
... </root>'''
>>> e = lxml.etree.fromstring(s)
>>> carData = e.xpath('Car/node()')
>>> carData
[<Element Color at 0x23a5460>, <Element Make at 0x23a54b0>, <Element Model at 0x23a5500>]
>>> dict(((e.tag, e.text) for e in carData))
{'Color': 'Blue', 'Make': 'Chevy', 'Model': None}

That said -- if your immediate goal is to iterate over the nodes in the tree, you might consider using lxml.etree.iterparse() instead, which will avoid trying to build a full DOM tree in memory and otherwise be much more efficient than building a tree and then iterating over it with XPath. 就是说-如果您的近期目标是遍历树中的节点,则可以考虑使用lxml.etree.iterparse() ,这将避免尝试在内存中构建完整的DOM树,否则将比构建一棵树,然后使用XPath对其进行迭代。 (Think SAX, but without the insane and painful API). (考虑一下SAX,但没有疯狂而痛苦的API)。

Implementing with iterparse could look like this: 使用iterparse实施可能看起来像这样:

def get_cars(infile):
    in_car = False
    current_car = {}
    for (event, element) in lxml.etree.iterparse(infile, events=('start', 'end')):
        if event == 'start':
            if element.tag == 'Car':
                in_car = True
                current_car = {}
            continue
        if not in_car: continue
        if element.tag == 'Car':
            yield current_car
            continue
        current_car[element.tag] = element.text

for car in get_cars(infile = cStringIO.StringIO('''<root><Car><Color>Blue</Color><Make>Chevy</Make><Model/></Car></root>''')):
  print car

...it's more code, but (if we weren't using StringIO for the example) it could process a file much larger than could fit in memory. ...这是更多代码,但是(如果我们不使用StringIO作为示例),它可以处理比内存大得多的文件。

我不知道lxml内部是否有更好的解决方案,但是您可以使用.get()

print parsedCarData[0].get('Model', '')

I would catch the exception: 我会捕捉到异常:

try:
    print parsedCarData[0]['Model']
except KeyError:
    print 'No model specified'

Exceptions in Python aren't exceptional in the same sense as in other languages, where they are more strictly linked to error conditions. Python中的异常并非与其他语言中的异常相同,在异常中,它们与错误条件的联系更为严格。 Instead they are frequently part of the normal usage of modules, by design. 相反,根据设计,它们通常是模块正常使用的一部分。 An iterator raises StopIteration to signal it has reached the end of the iteration, for example. 例如,迭代器引发StopIteration以信号通知它已到达迭代结束。

Edit: If you're sure only this item can be empty @CharlesDuffy has it right in that using get() is probably better. 编辑:如果您确定只有此项目可以为空,@ CharlesDuffy正确使用get()可能更好。 But in general I'd consider using exceptions for handling diverse exceptional output easily. 但是总的来说,我会考虑使用异常来轻松处理各种异常输出。

解决方案:使用try/except块捕获关键错误。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM