[英]Iterating Over Xml using findall , Lxml
I have the following xml : 我有以下xml:
<head>
<body>
<para>
<Run>
<Runprop>
<highlight val="red"/>
<break/>
<text>
Hello there
</text>
</RunProp>
</Run>
<Run>
<break/>
</Run>
<Run>
<text>
See you there
</text>
</Run>
</para> ..
</body>
</head>
I want to extract all text with the highlight
"red" value. 我想用
highlight
“红色”值提取所有文本。 Note that highlight
tag is one level down to that of the text tag. 请注意,
highlight
标记比文本标记低一级。 And the conditions are: 条件是:
highlight
tag , add a space. highlight
标记的父项时遇到break标记,请添加空格。 highlight
tag highlight
标记对应的文本 What I have done is: 我所做的是:
text="" #initialize an empty string
for p in lxml_tree.findall('para'): #itertate over each paragraph (all paragarpahs have the same tag name para)
for r in p.findall("Run"): #iterate over each run
for a in r.iter(tag="highlight"): #search for highlight tag
for b in a.iterancestors(): #go back to the parents
if b.tag=="break": #if break found
text+=" " # add a space
elif b.tag=="text": # if text found
text+=''.join(b.text) #add text
The above doesn't seem to work as iterancestors travels all the way to the root node. 以上似乎不起作用,因为iterancestors一直到达根节点。 How could i possibly iterate over the parents ie
Runprop
, break
, and text
?? 我怎么可能迭代父母,即
Runprop
, break
和text
? I have implemented something similar to this for all the text and that worked.. 我已经为所有文本实现了类似的东西,并且有效。
Edit 1 : 编辑1 :
Just a flawed logic above , I would rather iterate over each Run
in a paragraph , search for break
first , then see if there's highlight within the Runprop
and then extract text in the parent's sibling. 上面只是一个有缺陷的逻辑,我宁愿遍历段落中的每个
Run
,首先搜索break
,然后查看Runprop
是否有高亮,然后在父级兄弟中提取文本。
I have managed to fix it after some thoughts and getting an idea from anzel's answer. 我已经设法解决了一些想法并从anzel的答案中得到一个想法。
text=""
for p in lxml_tree.findall('para'): #iterate over paragraphs
text+= " " #add spaces
for r in p.findall("Run"): #iterate over each run in para
for a in r.findall("break"): #search for break tag in it and add space if found
text+= " "
for b in r.findall('.//highlight[@val="red"]/../..//text'): #search for red highlight in that run and return text
text+=''.join(b.text) # append text to main string
Since your xml has a positional pattern where <highlight>
, <break />
and <text>
, you actually don't need to go back to parent. 由于您的xml具有
<highlight>
, <break />
和<text>
的位置模式,因此您实际上不需要返回到父级。
I'm going to use iter
and getnext
to achieve what you need: 我将使用
iter
和getnext
来实现您的需求:
from lxml import etree
html = '''
<head>
<body>
<para>
<Run>
<RunProp>
<highlight val="red" />
<break/>
<text>
Hello there
</text>
</RunProp>
</Run>
<Run>
<break/>
</Run>
<Run>
<text>
See you there
</text>
</Run>
</para> ..
</body>
</head>'''
tree = etree.fromstring(html)
for node in tree.iter():
if node.tag == 'para':
node.text = '..your space here..' + node.text
print node.text
if node.tag == 'highlight':
print node.values()
if node.getnext().tag == 'break':
print node.getnext().tag
if node.getnext().getnext().tag == 'text':
node.getnext().getnext().text = \
'..your space here..' + node.getnext().getnext().text
print node.getnext().getnext().text
elif node.getnext().tag == 'text':
print node.getnext().text
..your space here....your space here..
['red']
break
..your space here....your space here..
Hello there
to write the changes to a file: 将更改写入文件:
etree.ElementTree(tree).write('output.xml', pretty_print=True)
cat output.xml
<head>
<body>
<para>..your space here..
<Run>
<RunProp>
<highlight val="red"/>
<break/>
<text>..your space here..
Hello there
</text>
</RunProp>
</Run>
<Run>
<break/>
</Run>
<Run>
<text>
See you there
</text>
</Run>
</para> ..
</body>
</head>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.