繁体   English   中英

如何使用python中的xml.etree.ElementTree解析当前节点中的所有子元素和孙元素元素

[英]How to parse all children and grandchildren elements from a current node using xml.etree.ElementTree in python

我正在提取xml文档中的所有文本。 我想寻找标签说明 ,然后搜索所有的孩子和孙子,可能会有更多的元素,然后提取文本。

这是我的代码,但它无法在孙子标签内获取文本:

for element in root.find('description'):
    print 'parent: ', element.tag, '|', element.attrib
    try:
        data.write(element.text)
        for all_tags in element.findall('./'):
            print 'child: ', all_tags.tag, '|', all_tags.attrib
            if all_tags.text:
                data.write('\n')
                data.write(all_tags.text)
                if all_tags.tail:
                    data.write('\n')
                    data.write(all_tags.tail)
                    data.write('\n')
        data.write('\n')
    except TypeError:
        pass
    except UnicodeEncodeError:
        unicodestr = element.text.encode("utf-8")
        data.write(unicodestr)

    data.write('\n')

问题出在for all_tags循环中。

样本输入:

<description>
<p num="p-0003">
Protein kinases are involved in the signal transduction pathways linking growth factors, hormones and other cell regulation molecules to cell growth, survival and metabolism under both normal and pathological conditions. One such protein kinase, protein kinase B (also known as Akt), is a serine/threonine kinase that plays a central role in promoting the proliferation and survival of a wide range of cell types, thereby protecting cells from apoptosis (programmed cell death) (Khwaja,
<i>Nature</i>
33-34 (1990)). Three members of the Akt/PKB subfamily of second-messenger regulated serine/threonine protein kinases have been identified and are termed Akt1/PKBα, Akt2/PKBβ, and Akt3/PKBγ. A number of proteins involved in cell proliferation and survival have been described as substrates of Akt in cells. Two examples of such substrates include glycogen synthase kinase-3 (GSK3) and Forkhead transcription factors (FKs). See Brazil and Hemmings,
<i>Trends in Biochemical Sciences</i>
26, 675-664.
</p>
<p num="p-0004">
A number of protein kinases and phosphatases regulate the activity of Akt; For instance, activation of Akt is mediated by phosphatidylinositol 3-kinase (PI3-K), which initiates the binding of second messenger phospholipids to the pleckstrin homology (PH) binding domain of Akt. The binding anchors Akt to plasma membrane and results in phosphorylation and activation of the enzyme. Amplifications of the catalytic subunit of PI3-K, p110α, or mutations in the PI3-K regulatory subunit, p85α, lead to activation of Akt in several types of human cancer. (Vivanco and Sawyers,
<i>Nature Reviews in Cancer</i>
(2002) 2: 489-501.
</p>
<p num="p-0005">
The tumor suppressor, PTEN, is a critical negative regulator of Akt activation by PI3-K. Myers et al.
</p>
</description>

在此输入中, <i> Nature </i>之后的文本被遗漏并被第一行中的文本替换。 我相信这是因为all_tags.tail从父标签获取文本而不是从子孙标签中获取文本。

element.findall('./')显式只查找标记的直接子节点。 找到所有后代的表达式是.// (双斜线)。

针对给定样本的循环的简化版本,然后导致:

>>> for element in root:
...     print 'parent: ', element.tag, '|', element.attrib
...     print element.text
...     for all_tags in element.findall('.//'):
...         print 'child: ', all_tags.tag, '|', all_tags.attrib
...         if all_tags.text:
...             print all_tags.text, '|', all_tags.tail
... 
parent:  p | {'num': 'p-0003'}

Protein kinases are involved in the signal transduction pathways linking growth factors, hormones and other cell regulation molecules to cell growth, survival and metabolism under both normal and pathological conditions. One such protein kinase, protein kinase B (also known as Akt), is a serine/threonine kinase that plays a central role in promoting the proliferation and survival of a wide range of cell types, thereby protecting cells from apoptosis (programmed cell death) (Khwaja,

child:  i | {}
Nature | 
33-34 (1990)). Three members of the Akt/PKB subfamily of second-messenger regulated serine/threonine protein kinases have been identified and are termed Akt1/PKBα, Akt2/PKBβ, and Akt3/PKBγ. A number of proteins involved in cell proliferation and survival have been described as substrates of Akt in cells. Two examples of such substrates include glycogen synthase kinase-3 (GSK3) and Forkhead transcription factors (FKs). See Brazil and Hemmings,

child:  i | {}
Trends in Biochemical Sciences | 
26, 675-664.

parent:  p | {'num': 'p-0004'}

A number of protein kinases and phosphatases regulate the activity of Akt; For instance, activation of Akt is mediated by phosphatidylinositol 3-kinase (PI3-K), which initiates the binding of second messenger phospholipids to the pleckstrin homology (PH) binding domain of Akt. The binding anchors Akt to plasma membrane and results in phosphorylation and activation of the enzyme. Amplifications of the catalytic subunit of PI3-K, p110α, or mutations in the PI3-K regulatory subunit, p85α, lead to activation of Akt in several types of human cancer. (Vivanco and Sawyers,

child:  i | {}
Nature Reviews in Cancer | 
(2002) 2: 489-501.

parent:  p | {'num': 'p-0005'}

The tumor suppressor, PTEN, is a critical negative regulator of Akt activation by PI3-K. Myers et al.

或使用repr()来显示字符串文字:

parent:  p | {'num': 'p-0003'}
'\nProtein kinases are involved in the signal transduction pathways linking growth factors, hormones and other cell regulation molecules to cell growth, survival and metabolism under both normal and pathological conditions. One such protein kinase, protein kinase B (also known as Akt), is a serine/threonine kinase that plays a central role in promoting the proliferation and survival of a wide range of cell types, thereby protecting cells from apoptosis (programmed cell death) (Khwaja,\n'
child:  i | {}
'Nature' | u'\n33-34 (1990)). Three members of the Akt/PKB subfamily of second-messenger regulated serine/threonine protein kinases have been identified and are termed Akt1/PKB\u03b1, Akt2/PKB\u03b2, and Akt3/PKB\u03b3. A number of proteins involved in cell proliferation and survival have been described as substrates of Akt in cells. Two examples of such substrates include glycogen synthase kinase-3 (GSK3) and Forkhead transcription factors (FKs). See Brazil and Hemmings,\n'
child:  i | {}
'Trends in Biochemical Sciences' | '\n26, 675-664.\n'
parent:  p | {'num': 'p-0004'}
u'\nA number of protein kinases and phosphatases regulate the activity of Akt; For instance, activation of Akt is mediated by phosphatidylinositol 3-kinase (PI3-K), which initiates the binding of second messenger phospholipids to the pleckstrin homology (PH) binding domain of Akt. The binding anchors Akt to plasma membrane and results in phosphorylation and activation of the enzyme. Amplifications of the catalytic subunit of PI3-K, p110\u03b1, or mutations in the PI3-K regulatory subunit, p85\u03b1, lead to activation of Akt in several types of human cancer. (Vivanco and Sawyers,\n'
child:  i | {}
'Nature Reviews in Cancer' | '\n(2002) 2: 489-501.\n'
parent:  p | {'num': 'p-0005'}
'\nThe tumor suppressor, PTEN, is a critical negative regulator of Akt activation by PI3-K. Myers et al.\n'

你可能想要使用itertext(),但如果你想稍微加强你的游戏,你应该试试xpath 在这些类型的情况下它真的很棒。

这是一个例子 - 我给出的xpath基本上说:

在XML文档中的任何位置查找所有标记,然后返回该文本以及所有子文本。

#!/usr/bin/python

from lxml import etree

tree = etree.fromstring(open('t.xml').read())

for el in tree.xpath('//description/descendant-or-self::*/text()'):
    print el

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM