[英]lxml eTree iterparse depth
我正在尝试解析以下格式的xml:
<label>
<name></name>
<sometag></sometag>
<sublabels>
<label></label>
<label></label>
</sublabel>
</label>
与此解析
for event, element in etree.iterparse(gzip.GzipFile(f), events=('end', ), tag='label'):
if event == 'end':
name = element.xpath('name/text()')
产生空名称变量,因为
<sublabels>
<label></label>
<label></label>
</sublabel>
问题:
除了检查子标签是否为空之外,是否可以设置iterparse的深度或忽略子标签?
这对我有用,并受到先前答案的启发:
name = None
level = 0
for event, element in etree.iterparse(gzip.GzipFile(f), events=('end', 'start' ), tag='label'):
# Update current level
if event == 'start': level += 1;
elif event == 'end': level -= 1;
# Get name for top level label
if level == 0:
name = element.xpath('name/text()')
作为一种替代解决方案,请分析整个文件,然后使用xpath获取顶部标签名称:
from lxml import html
with gzip.open(f, 'rb') as f:
file_content = f.read()
tree = html.fromstring(file_content)
name = tree.xpath('//label/name/text()')
我想到的第一件事
path = []
for event, element in etree.iterparse(gzip.GzipFile(f), events=('start', 'end')):
if event == 'start':
path.append(element.tag)
elif event == 'end':
if element.tag == 'label':
if not 'sublabels' in path:
name = element.xpath('name/text()')
path.pop()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.