使用 python 提取 XML 数据

Question

我在 XML （命名为bibliography.xml ）的<list>中有大量不同作者及其所选作品的列表。 这是一个例子：

<list type="index">
                <item><persName>Poe, Edgar Allan</persName>, <note>1809—1849</note>, <bibl>The Black
                        Cat 1843 (<abbr>Cat.</abbr>).</bibl> — <bibl>The Gold-Bug 1843
                            (<abbr>Bug.</abbr>).</bibl> — <bibl>The Raven 1845
                        (<abbr>Rav.</abbr>).</bibl></item>

                <item><persName>Melville, Herman</persName>, <bibl>Benito Cereno 1855
                            (<abbr>Ben.</abbr>)</bibl> — <bibl>Moby-Dick 1851
                        (<abbr>MobD.</abbr>)</bibl> — <bibl>Typee: A Peep at Polynesian Life 1846
                            (<abbr>PolyL.</abbr>)</bibl></item>
                
                <item><persName>Barth, John</persName>, <note>(*1930)</note>, <bibl>The Sot-Weed
                        Factor 1960 (<abbr>Fac.</abbr>)</bibl> — <bibl>Giles Goat-Boy 1960
                            (<abbr>Gil.</abbr>)</bibl></item>
            </list>

import xml.etree.ElementTree as ET

tree = ET.parse('bibliography.xml')
root = tree.getroot()

for work in root:
    if(work.tag=='item'):
        print work.get('persName')
            if (attr.tag=='abbr')
                print (attr.text)

显然它不起作用，但由于我对 python 完全陌生，所以我无法思考我做错了什么。 如果有人可以在这里帮助我，将不胜感激。

Answer 1

即使我尝试了与您相同的方式并遇到了同样的问题。 我别无选择，只能将整个 xml 转换为 pretty-xml，并将其视为单个字符串。 然后将每一行迭代到一个特定的标签。

import xml.dom.minidom

dom = xml.dom.minidom.parse("bibliography.xml")
pretty_xml = dom.toprettyxml()
pretty_xml = pretty_xml.split("\n")
start, end = [], [] # store the beginning and the end of "item" tag

for idx in range(len(pretty_xml)):
        if "item" in pretty_xml[idx]:
            if "/" not in pretty_xml[idx]:
                start.append(idx)
            else:
                end.append(idx)

现在你知道在 start[0] 和 end[0] 之间你有你的第一个数据点可用。 同样明智地使用“if”条件依次迭代两个列表的所有元素，结构有点像这样（我不是在编写整个代码）：

for idx in range(len(start)):
    for line in pretty_xml[start[idx] + 1 : end[idx]]:
        line.split("persName")[1].replace("<","").replace(">","").replace("/","")
         ...
         ...

（如果您找到更好的结构化方法，请告诉我。）

Answer 2

考虑使用XPath来获取数据。 只需调用tree.xpath("//item")即可返回所有项目。

下面是一个基于 XML 片段的工作示例。 tree.getroot()将仅根据完整的 xml 起作用。

基本工作示例：

import lxml.etree as etree

xml = '''<list type="index">
            <item><persName>Poe, Edgar Allan</persName>, <note>1809—1849</note>, <bibl>The Black
                    Cat 1843 <abbr>(Cat.).</abbr></bibl> — <bibl>The Gold-Bug 1843
                        <abbr>(Bug.)</abbr>.</bibl> — <bibl>The Raven 1845
                    <abbr>(Rav.)</abbr>.</bibl></item>

            <item><persName>Melville, Herman</persName>, <bibl>Benito Cereno 1855
                        (<abbr>Ben.</abbr>)</bibl> — <bibl>Moby-Dick 1851
                    (<abbr>MobD.</abbr>)</bibl> — <bibl>Typee: A Peep at Polynesian Life 1846
                        (<abbr>PolyL.</abbr>)</bibl></item>
            
            <item><persName>Barth, John</persName>, <note>(*1930)</note>, <bibl>The Sot-Weed
                    Factor 1960 (<abbr>Fac.</abbr>)</bibl> — <bibl>Giles Goat-Boy 1960
                        (<abbr>Gil.</abbr>)</bibl></item>
        </list>
'''
tree = etree.fromstring(xml)
#root = tree.getroot()

for work in tree.xpath("//item"):
    persName = work.find('persName').text.strip()
    abbr =' '.join([x.text for x in work.xpath('bibl/abbr')])
    print (f'{persName} {abbr}')

Output：

Poe, Edgar Allan (Cat.). (Bug.) (Rav.)
Melville, Herman Ben. MobD. PolyL.
Barth, John Fac. Gil.

使用 python 提取 XML 数据

问题描述

2 个解决方案

解决方案1
0 2021-02-27 12:43:41

解决方案2
0 2021-02-27 13:11:03

使用 python 提取 XML 数据

问题描述

2 个解决方案

解决方案1 0 2021-02-27 12:43:41

解决方案2 0 2021-02-27 13:11:03

解决方案1
0 2021-02-27 12:43:41

解决方案2
0 2021-02-27 13:11:03