简体   繁体   English

使用findall,Lxml迭代Xml

[英]Iterating Over Xml using findall , Lxml

I have the following xml : 我有以下xml:

<head>
  <body>
    <para>
      <Run>
        <Runprop>
           <highlight val="red"/>
        <break/>
        <text>
         Hello there
        </text>
        </RunProp>
      </Run>
      <Run>
        <break/>
      </Run>
      <Run>
         <text>
          See you there
         </text>
      </Run>
    </para> ..
  </body>
</head>  

I want to extract all text with the highlight "red" value. 我想用highlight “红色”值提取所有文本。 Note that highlight tag is one level down to that of the text tag. 请注意, highlight标记比文本标记低一级。 And the conditions are: 条件是:

  1. For every paragraph , add an extra space . 对于每个段落,添加一个额外的空间。
  2. If the break tag is encountered whilst iterating over parents of highlight tag , add a space. 如果在迭代highlight标记的父项时遇到break标记,请添加空格。
  3. Extract text only corresponding to the highlight tag 仅提取与highlight标记对应的文本

What I have done is: 我所做的是:

text=""                                #initialize an empty string
for p in lxml_tree.findall('para'): #itertate over each paragraph (all paragarpahs have the same tag name para)
    for r in p.findall("Run"):     #iterate over each run
         for a in r.iter(tag="highlight"): #search for highlight tag
            for b in a.iterancestors(): #go back to the parents
                if b.tag=="break":     #if break found
                   text+=" "           # add a space
                elif b.tag=="text":    # if text found
                   text+=''.join(b.text) #add text 

The above doesn't seem to work as iterancestors travels all the way to the root node. 以上似乎不起作用,因为iterancestors一直到达根节点。 How could i possibly iterate over the parents ie Runprop , break , and text ?? 我怎么可能迭代父母,即Runpropbreaktext I have implemented something similar to this for all the text and that worked.. 我已经为所有文本实现了类似的东西,并且有效。

Edit 1 : 编辑1
Just a flawed logic above , I would rather iterate over each Run in a paragraph , search for break first , then see if there's highlight within the Runprop and then extract text in the parent's sibling. 上面只是一个有缺陷的逻辑,我宁愿遍历段落中的每个Run ,首先搜索break ,然后查看Runprop是否有高亮,然后在父级兄弟中提取文本。

I have managed to fix it after some thoughts and getting an idea from anzel's answer. 我已经设法解决了一些想法并从anzel的答案中得到一个想法。

text=""          
for p in lxml_tree.findall('para'):   #iterate over paragraphs
    text+= " "                        #add spaces
    for r in p.findall("Run"):        #iterate over each run in para
         for a in r.findall("break"):  #search for break tag in it and add space if found
            text+= " "
         for b in r.findall('.//highlight[@val="red"]/../..//text'): #search for red highlight in that run and return text
             text+=''.join(b.text) # append text to main string

Since your xml has a positional pattern where <highlight> , <break /> and <text> , you actually don't need to go back to parent. 由于您的xml具有<highlight><break /><text>的位置模式,因此您实际上不需要返回到父级。

I'm going to use iter and getnext to achieve what you need: 我将使用itergetnext来实现您的需求:

from lxml import etree

html = '''
<head>
  <body>
    <para>
      <Run>
        <RunProp>
           <highlight val="red" />
        <break/>
        <text>
         Hello there
        </text>
        </RunProp>
      </Run>
      <Run>
        <break/>
      </Run>
      <Run>
         <text>
          See you there
         </text>
      </Run>
    </para> ..
  </body>
</head>'''

tree = etree.fromstring(html)

for node in tree.iter():
    if node.tag == 'para':
        node.text = '..your space here..' + node.text
        print node.text
    if node.tag == 'highlight':
        print node.values()
        if node.getnext().tag == 'break':
            print node.getnext().tag
            if node.getnext().getnext().tag == 'text':
                node.getnext().getnext().text = \
                    '..your space here..' + node.getnext().getnext().text
                print node.getnext().getnext().text
        elif node.getnext().tag == 'text':
            print node.getnext().text

..your space here....your space here..

['red']
break
..your space here....your space here..
         Hello there

to write the changes to a file: 将更改写入文件:

etree.ElementTree(tree).write('output.xml', pretty_print=True)

cat output.xml
<head>
  <body>
    <para>..your space here..
      <Run>
        <RunProp>
           <highlight val="red"/>
        <break/>
        <text>..your space here..
         Hello there
        </text>
        </RunProp>
      </Run>
      <Run>
        <break/>
      </Run>
      <Run>
         <text>
          See you there
         </text>
      </Run>
    </para> ..
  </body>
</head>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM