简体   繁体   English

使用lxml解析奇怪的结构化XML

[英]Parsing weirdly structured XML with lxml

I have a number of XML files that I need to parse. 我有许多需要解析的XML文件。 I've written some code that works, but is ugly, and I'd like to get some advice from people more experienced with XML than I am. 我写了一些有用的代码,但很难看,我想从比我更有经验的人那里得到一些建议。

First of all, I might be using some terms in the wrong context, because my experience with XML is limited. 首先,我可能在错误的上下文中使用了一些术语,因为我对XML的经验是有限的。 By element, unless specified otherwise, I mean something like this: 按元素,除非另有说明,我的意思是这样的:

 <root>
  <element>
   ...
  </element>
  <element>
   ...
  </element>
 </root>  

Anyway, each file consist of a number of elements, with a number of child elements (obviously). 无论如何,每个文件都包含许多元素,其中包含许多子元素(显然)。 What stumps me is that the relevant values need to be accessed in four different ways; 让我感到困惑的是,需要以四种不同的方式获取相关的价值观;

1) Node text: 1)节点文本:

<tag>value</tag>

2) Attribute: 2)属性:

<tag attribute="value"></tag>

3) A value "hidden" inside a tag ("true" in this case): 3)标签内“隐藏”的值(在这种情况下为“true”):

<tag><boolean.true/></tag>

4) Values inside tags of the same name ("tagA"), but with "grandparent" tags with different names ("tag1" and "tag2"), all within the same element. 4)同名标签内的值(“tagA”),但具有不同名称的“祖父母”标签(“tag1”和“tag2”),都在同一元素内。 "tagA" is of no use to me, instead I will be looking for "tag1" and "tag2". “tagA”对我没用,相反我会寻找“tag1”和“tag2”。

<element>
   <tag1><tagA>value</tagA><tag1>
   <tag2><tagA>value</tagA></tag2>
</element>

At the moment I have a dictionary with each file as a key. 目前我有一个字典,每个文件都是一个关键字。 The values are dictionaries with the keys "attribute", "node text", "tag" and "parent element". 值是带有“属性”,“节点文本”,“标记”和“父元素”键的字典。

Example: 例:

{'file1.xml' : 'attributes' : {'Person': 'Id', 'Car' : 'Color'},
               'node text': ['Name', 'Address'],
}

Where "Person" and "Car" are tags, and "Id" and "Color" are attribute names. “Person”和“Car”是标签,“Id”和“Color”是属性名称。

This makes it easy to iterate over all elements and inspect each tag, and if there is a match in the dictionary (if elem.tag in dict['file1.xml']['attributes']), extract the value. 这使得迭代所有元素并检查每个标记变得容易,如果字典中存在匹配(如果dict ['file1.xml'] ['attributes']中的elem.tag),则提取值。

So as I said, the code works, but I don't like my solution. 正如我所说,代码有效,但我不喜欢我的解决方案。 Also, not all the elements have all the child elements (for example, a Person might not own a car, then that tag will be missing altogether), and I need to give assign those values "None". 此外,并非所有元素都具有所有子元素(例如,Person可能没有汽车,那么该标签将完全丢失),我需要将这些值分配给“None”。 Right now I get all the tags that should exist for every element in each file, turn them into a set, then check the difference between those and the set of tags that I've actually extracted values from for that element. 现在我得到每个文件中每个元素应该存在的所有标记,将它们变成一个集合,然后检查它们与我实际从该元素中提取值的标记集之间的差异。 Again, the code is pretty ugly. 再一次,代码非常难看。

Hopefully this mess makes some sense. 希望这个烂摊子有道理。

edit: 编辑:

I used JF Sebastian's suggestion of storing the xpath to each value in a dictionary with the field name as the key and xpath as value. 我使用了JF Sebastian建议将xpath存储到字典中的每个值,字段名称作为键,xpath作为值。

You could streamline your input code by using xpath expressions relative your element instead of a complex data-structure eg, #1-4 cases: 您可以通过使用相对于元素的xpath表达式而不是复杂的数据结构来简化输入代码,例如,#1-4情况:

  1. tag/text() 标签/文本()
  2. tag/@attribute 标签/ @属性
  3. name(DTBoolean/*[1]) 名称(DTBoolean / * [1])
  4. (tag1|tag2)/*/text() (标记1 |标签2)/ * /文()

What output data-structure to use depends on how do you like it to be used in your code later. 要使用哪种输出数据结构取决于您希望以后如何在代码中使用它。 You could start with a structure that is most convenient for your current code. 您可以从对当前代码最方便的结构开始。 And evolve it to a more general solution later when you better understand the requirements. 当您更好地理解要求时,将其演变为更通用的解决方案。

I output it to csv, where each element is one row in the csv file. 我将其输出到csv,其中每个元素都是csv文件中的一行。 ... I use a defaultdict to store the elements and then store those in a list before I output them to csv. ...我使用defaultdict存储元素,然后在将它们输出到csv之前将它们存储在列表中。

You could use ordinary dict and csv.DictWriter(fieldnames=xpathdict.keys()): 你可以使用普通的dict和csv.DictWriter(fieldnames = xpathdict.keys()):

# for each element
row_dict = dict.fromkeys(xpathdict.keys())
...
# for each key 
row_dict[key] = element.xpath(xpathdict[key]) or None
...
dictwriter.writerow(row_dict)

Where xpathdict is a mapping between field names and corresponding xpath expressions. 其中xpathdict是字段名称和相应的xpath表达式之间的映射。 For generality you could store function objects f(element) -> csv field instead of/in addition to xpath exprs. 一般而言,除了xpath exprs之外,您还可以存储函数对象f(element) -> csv field而不是/。

I don't think #3 is legal XML because there's no opening tag associated with and even if it's somewhere else, it wouldn't be properly nested in that example. 我不认为#3是合法的XML,因为没有关联的开放标记,即使它在其他地方,它也不会在该示例中正确嵌套。 The expression will be interpreted as a closing tag because of the < character. 由于<字符,表达式将被解释为结束标记。

I'm assuming that you'd want to take something like this: 我假设你想要这样的东西:

<root>
  <element>
    <text_attribute>Some Text</text_attribute>
    <attribute var="blah"/>
    <bool_attribute><boolean.true/></bool_attribute>
  </element>
  <element>
    <text_attribute>Some more Text</text_attribute>
    <attribute var="blah again"/>
    <bool_attribute><boolean.false/></bool_attribute>
  </element>
</root>

And get something like this: 得到这样的东西:

[
   { "text_attribute":"Some Text", "attribute":"blah", "bool_attribute":True },
   { "text_attribute":"Some more Text", "attribute":"blah again", "bool_attribute":False }
]

To do this I'd do something like this (untested): 要做到这一点,我会做这样的事情(未经测试):

# Helper function so we can extract a default from an xpath result if empty
def get_first(x, default_value):
  if(len(x)>0) return x[0]
  return default_value

# Parse one element
def process_element( e ):
  retval = {}
  retval['text_attribute'] = get_first(e.xpath("text_attribute/text()"), "default text")
  retval['attribute'] = get_first( e.xpath("attribute/@var"), "default attribute")
  retval['bool_attribute'] = get_first( e.xpath("bool_attribute/boolean.true"), False )
  return retval

# Parse all the elements
elements = []
elements_xml = xml.xpath('/root/element')
for e in elements_xml:
  elements.push( process_element(e) )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM