简体   繁体   English

子标记值的Python XML不同列表

[英]Python XML distinct list of child tags values

I'm writing my first Python XML query and not succeeding. 我正在编写第一个Python XML查询,但没有成功。 I have the following code: 我有以下代码:

import xml.etree.ElementTree as ET
root = ET.parse(r'wiktionary.xml').getroot()
s = set()
for ns_tag in root.findall('page/ns'):
    value = ns_tag.text
    s.add(value)
for page_tag in root.findall('page'):
    print(page_tax.find('ns').text)

The XML is of the form XML的形式

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="af">
  <siteinfo>
    <sitename>Wiktionary</sitename>
    <dbname>afwiktionary</dbname>
    <base>https://af.wiktionary.org/wiki/Tuisblad</base>
    <generator>MediaWiki 1.33.0-wmf.19</generator>
    <case>case-sensitive</case>
    <namespaces>
      <namespace key="-2" case="case-sensitive">Media</namespace>
      <namespace key="-1" case="first-letter">Spesiaal</namespace>
      <namespace key="0" case="case-sensitive" />
      <namespace key="1" case="case-sensitive">Bespreking</namespace>
      <namespace key="2" case="first-letter">Gebruiker</namespace>
      <namespace key="3" case="first-letter">Gebruikerbespreking</namespace>
      <namespace key="4" case="case-sensitive">Wiktionary</namespace>
      <namespace key="5" case="case-sensitive">Wiktionarybespreking</namespace>
      <namespace key="6" case="case-sensitive">Lêer</namespace>
      <namespace key="7" case="case-sensitive">Lêerbespreking</namespace>
      <namespace key="8" case="first-letter">MediaWiki</namespace>
      <namespace key="9" case="first-letter">MediaWikibespreking</namespace>
      <namespace key="10" case="case-sensitive">Sjabloon</namespace>
      <namespace key="11" case="case-sensitive">Sjabloonbespreking</namespace>
      <namespace key="12" case="case-sensitive">Hulp</namespace>
      <namespace key="13" case="case-sensitive">Hulpbespreking</namespace>
      <namespace key="14" case="case-sensitive">Kategorie</namespace>
      <namespace key="15" case="case-sensitive">Kategoriebespreking</namespace>
      <namespace key="828" case="case-sensitive">Module</namespace>
      <namespace key="829" case="case-sensitive">Module talk</namespace>
      <namespace key="2300" case="case-sensitive">Gadget</namespace>
      <namespace key="2301" case="case-sensitive">Gadget talk</namespace>
      <namespace key="2302" case="case-sensitive">Gadget definition</namespace>
      <namespace key="2303" case="case-sensitive">Gadget definition talk</namespace>
    </namespaces>
  </siteinfo>
  <page>
    <title>MediaWiki:Edithelppage</title>
    <ns>8</ns>
    <id>21</id>
    <revision>
      <id>17266</id>
      <parentid>3081</parentid>
      <timestamp>2006-06-23T20:14:26Z</timestamp>
      <contributor>
        <username>Manie</username>
        <id>18</id>
      </contributor>
      <minor />
      <comment>typo</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text xml:space="preserve">{{ns:4}}:Redigeer</text>
      <sha1>nrxi3shwkpass1der614rzu1wcrjdok</sha1>
    </revision>
  </page>
  <page>
    <title>MediaWiki:Sitesubtitle</title>
    <ns>8</ns>
    <id>70</id>
    <revision>
      <id>17618</id>
      <parentid>7587</parentid>
      <timestamp>2006-06-26T16:10:58Z</timestamp>
      <contributor>
        <username>Manie</username>
        <id>18</id>
      </contributor>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text xml:space="preserve">Die vrye woordeboek</text>
      <sha1>3gmth4w27p5u4mdo8yo8qbb2cj47l1b</sha1>
    </revision>
  </page>
</mediawiki>

This code also prints nothing: 此代码也不会显示任何内容:

from lxml import etree

tree = etree.parse(r'E:\Downloads\WikipediaAF\test2.xml')
root = tree.getroot()
for ns_tag in root.findall('page'):
    for tag in ns_tag.getchildren():
        if tag.tag == 'ns':
            print(tag.text)

I'm trying to extract a distinct list of the values between the <ns> tags, but the set s comes back as empty and nothing is printed either. 我正在尝试提取<ns>标记之间的值的不同列表,但set s返回为空,并且也不打印任何内容。

Anybody know where I'm going wrong? 有人知道我要去哪里错吗?

If I understand your question correctly, let's try this (somewhat different approach from yours, but I'm more used to it...): 如果我正确理解了您的问题,请尝试一下(与您的方法有些不同,但我更习惯了...):

wik = """
<mediawiki>
 <siteinfo>
 ...
 </siteinfo>
 <page>
  <title>MediaWiki:Edithelppage</title>
  <ns>8</ns>
<id>21</id>
</page>
</mediawiki>
 """
import lxml.html as LH
root = LH.fromstring(wik)
for ns_tag in root.findall('page'):
  for tag in ns_tag.getchildren():
     print(tag.text)

Output: 输出:

MediaWiki:Edithelppage
8
21

You can modify it to add the output to sets, lists or whatnot. 您可以修改它以将输出添加到集合,列表或其他内容。

Since you're only looking for the ns element, maybe this will work: 由于您只在寻找ns元素,因此也许可以使用:

for ns_tag in root.findall('page'):
for tag in ns_tag.getchildren():
        if tag.tag == 'ns':
            print(tag.text)

Output: 输出:

8

Is this the idea? 这是主意吗?

Turns out it needed namespaces. 原来它是需要的名称空间。 This code worked: 此代码有效:

from lxml import etree

def parseBookXML(xmlFile):

    with open(xmlFile, encoding='UTF8') as fobj:
        xml = fobj.read()

    root = etree.fromstring(xml)

    s = set()
    for ns_tag in root.findall('{http://www.mediawiki.org/xml/export-0.10/}page'):
        for tag in ns_tag.getchildren():
            if tag.tag == '{http://www.mediawiki.org/xml/export-0.10/}ns':
                s.add(tag.text)
    print(s)

if __name__ == "__main__":
    parseBookXML(r'E:\Downloads\WikipediaAF\afwiktionary-20190301-pages-articles.xml')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM