简体   繁体   English

如何使用类似于 ElementTree 的 lxml 遍历 XML 文档标签

[英]How to traverse through XML document tags using lxml similarly to ElementTree

Currently I'm editing XML document, where I have to edit few tags and their attributes.目前我正在编辑 XML 文档,我必须在其中编辑一些标签及其属性。 Up to now I was using ElementTree library, however I encountered problems with namespace preservation, so I'm trying to rewrite my script to use lxml .到目前为止,我一直在使用ElementTree库,但是我遇到了命名空间保存的问题,所以我试图重写我的脚本以使用lxml ElementTree however was very logical for me in case of traversing through the document tags.但是,在遍历文档标签的情况下, ElementTree对我来说非常合乎逻辑。 Below as an example, I'll provide code that will remove Ext tag in XML, and change Resolution tag text to different value.下面作为示例,我将提供将删除 XML 中的Ext标记的代码,并将Resolution标记文本更改为不同的值。

ElementTree:元素树:

namespaces = dict([elem for _, elem in ET.iterparse(adiPath, events=['start-ns'])])
for ns in namespaces:
    ET.register_namespace(ns, namespaces[ns])
for asset in root.findall('.//{*}Asset'):
    if 'title:TitleType' in asset.attrib.values():
        ext = asset.find('.//{*}Ext')
        if ext != None:
            asset.remove(ext)
    if 'content:PreviewType' in asset.attrib.values():
            resolution = asset.find(".//{*}Resolution")
            resolution.text = 'different value'

Is it possible to iterate through XML file in similar way to above mentioned, but instead of ET use lxml ?是否可以以与上述类似的方式遍历 XML 文件,而不是ET使用lxml

XML File: XML 文件:

<?xml version="1.0" encoding="utf-8"?>
<ADI3 xmlns="urn:cablelabs:md:xsd:core:3.0"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xmlns:content="urn:cablelabs:md:xsd:content:3.0"
      xmlns:core="urn:cablelabs:md:xsd:core:3.0"
      xmlns:offer="urn:cablelabs:md:xsd:offer:3.0"
      xmlns:terms="urn:cablelabs:md:xsd:terms:3.0"
      xmlns:title="urn:cablelabs:md:xsd:title:3.0"
      xmlns:adb="urn:adb:md:xsd:adb:01"
      xmlns:schemaLocation="urn:adb:md:xsd:adb:01 ADB-EXT-C01.xsd urn:cablelabs:md:xsd:core:3.0 MD-SP-CORE-C01.xsd urn:cablelabs:md:xsd:content:3.0 MD-SP-CONTENT-C01.xsd urn:cablelabs:md:xsd:offer:3.0 MD-SP-OFFER-C01.xsd urn:cablelabs:md:xsd:terms:3.0 MD-SP-TERMS-C01.xsd urn:cablelabs:md:xsd:title:3.0 MD-SP-TITLE-C01.xsd"
      xmlns:xml="http://www.w3.org/XML/1998/namespace">
  <Asset xsi:type="title:TitleType" uriId="ID" providerVersionNum="5"
     internalVersionNum="0" creationDateTime="2020-04-22T00:00:00Z"
     startDateTime="2020-03-24T09:00:00Z" endDateTime="2022-10-06T23:59:00Z">
    <AlternateId identifierSystem="VOD1.1">ID</AlternateId>
    <ProviderQAContact>Contact</ProviderQAContact>
    <Ext>
      <adb:ExtensionType>
        <adb:TitleExt>
          <adb:SeriesInfo episodeNumber="16">
            <adb:series seriesId="106585" seasonCount="2"/>
            <adb:season seasonId="106586" number="1" episodeCount="22"/>
          </adb:SeriesInfo>
        </adb:TitleExt>
      </adb:ExtensionType>
    </Ext>
    <title:LocalizableTitle xml:lang="pol">
      <title:TitleLong>BATWOMAN EP. 16 - THROUGH THE LOOKING GLASS</title:TitleLong>
      <title:SummaryLong> Very long summary...</title:SummaryLong>
      <title:Actor fullName="Ruby Rose" firstName="Ruby" lastName="Rose"/>
      <title:Actor fullName="Rachel Skarsten" firstName="Rachel" lastName="Skarsten"/>
      <title:Actor fullName="Meagan Tandy" firstName="Meagan" lastName="Tandy"/>
      <title:Actor fullName="Camrus Johnson" firstName="Camrus" lastName="Johnson"/>
      <title:Director fullName="Sudz Sutherland" firstName="Sudz" lastName="Sutherland"/>
    </title:LocalizableTitle>
    <title:Rating ratingSystem="PL">12</title:Rating>
    <title:DisplayRunTime>00:40</title:DisplayRunTime>
    <title:Year>2019</title:Year>
    <title:CountryOfOrigin>US</title:CountryOfOrigin>
    <title:Genre>Genre</title:Genre>
    <title:ShowType>Movie</title:ShowType>
  </Asset>
  <Asset xsi:type="offer:CategoryType" uriId="ID">
    <AlternateId identifierSystem="VOD1.1">ID</AlternateId>
    <offer:CategoryPath>Path</offer:CategoryPath>
  </Asset>
  <Asset xsi:type="content:MovieType" uriId="namemp4">
    <AlternateId identifierSystem="VOD1.1">namemp4</AlternateId>
    <content:SourceUrl>name.mp4</content:SourceUrl>
    <content:Resolution>resolution</content:Resolution>
    <content:Duration>PT0H40M40S</content:Duration>
    <content:Language>pol</content:Language>
    <content:SubtitleLanguage>pol</content:SubtitleLanguage>
    <content:SubtitleLanguage>eng</content:SubtitleLanguage>
  </Asset>
  <Asset uriId="ID" xsi:type="content:MovieType">
    <AlternateId identifierSystem="VOD1.1">ID</AlternateId>
    <Provider>Prov</Provider>
    <content:SourceUrl>sub.srt</content:SourceUrl>
  </Asset>
  <Asset uriId="ID" xsi:type="content:MovieType">
    <AlternateId identifierSystem="VOD1.1">ID</AlternateId>
    <Provider>Prov</Provider>
    <content:SourceUrl>sub.srt</content:SourceUrl>
  </Asset>
  <Asset xsi:type="content:PosterType" uriId="ID">
    <AlternateId identifierSystem="VOD1.1">ID</AlternateId>
    <content:SourceUrl>poster.jpg</content:SourceUrl>
    <content:X_Resolution>700</content:X_Resolution>
    <content:Y_Resolution>1000</content:Y_Resolution>
    <content:Language>pol</content:Language>
  </Asset>
  <Asset xsi:type="offer:ContentGroupType" uriId="ID">
    <AlternateId identifierSystem="VOD1.1">ID</AlternateId>
    <offer:TitleRef uriId="ID"/>
    <offer:MovieRef uriId="namets"/>
    <offer:MovieRef uriId="subs"/>
    <offer:MovieRef uriId="subs"/>
  </Asset>
  <Asset xsi:type="offer:ContentGroupType" uriId="ID">
    <AlternateId identifierSystem="VOD1.1">ID</AlternateId>
    <offer:TitleRef uriId="ID"/>
    <offer:MovieRef uriId="poster"/> 
  </Asset>
</ADI3>

Observations about your input document:关于您的输入文档的意见:

  • The document defines the default namespace ( xmlns="..." ) as urn:cablelabs:md:xsd:core:3.0 .该文档将默认命名空间 ( xmlns="..." ) 定义为urn:cablelabs:md:xsd:core:3.0
  • It defines the same namespace a second time as "core" ( xmlns:core="urn:cablelabs:md:xsd:core:3.0" ).它第二次将相同的命名空间定义为“core”( xmlns:core="urn:cablelabs:md:xsd:core:3.0" )。
  • xmlns:schemaLocation is wrong and should be xsi:schemaLocation . xmlns:schemaLocation是错误的,应该是xsi:schemaLocation
  • the namespace called "terms" ( urn:cablelabs:md:xsd:terms:3.0 ) is not used at all.根本不使用名为“terms”( urn:cablelabs:md:xsd:terms:3.0 )的命名空间。

When you read this document and write it again, as your code sample does it, all the information is retained.当您阅读本文档并再次编写时,正如您的代码示例所做的那样,所有信息都将保留。

But there is no guarantee that the output document is a character-by-character copy of the input document.但不能保证 output 文档是输入文档的逐字符副本。 That's not how XML works, and it's an unreasonable expectation.这不是 XML 的工作原理,这是一个不合理的期望。 The guarantee that matters is that the output document is semantically equivalent to the input document.重要的保证是 output 文档在语义上等同于输入文档。

When your code runs, it produces this output (abridged):当您的代码运行时,它会生成此 output(删节):

<core:ADI3
  xmlns:adb="urn:adb:md:xsd:adb:01"
  xmlns:content="urn:cablelabs:md:xsd:content:3.0"
  xmlns:core="urn:cablelabs:md:xsd:core:3.0" 
  xmlns:offer="urn:cablelabs:md:xsd:offer:3.0"
  xmlns:title="urn:cablelabs:md:xsd:title:3.0" 
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>
  <core:Asset xsi:type="title:TitleType" uriId="ID" providerVersionNum="5" internalVersionNum="0" creationDateTime="2020-04-22T00:00:00Z" startDateTime="2020-03-24T09:00:00Z" endDateTime="2022-10-06T23:59:00Z">
    <core:AlternateId identifierSystem="VOD1.1">ID</core:AlternateId>

    <!-- ... -->

  </core:Asset>
</core:ADI3>

The ADI3 element is still in the urn:cablelabs:md:xsd:core:3.0 namespace, as before.与以前一样, ADI3元素仍位于urn:cablelabs:md:xsd:core:3.0命名空间中。 Whether this is achieved via default namespace or via explicit prefix is irrelevant.这是通过默认命名空间还是通过显式前缀实现的无关紧要。 ElementTree knew a prefix for this namespace - "core" - and decided to use it. ElementTree 知道这个命名空间的前缀——“core”——并决定使用它。 There is nothing wrong with that, it's still the same thing.这并没有错,它仍然是同样的事情。

The namespace urn:cablelabs:md:xsd:terms:3.0 ("terms") is missing from the output because it was unused in the input and keeping unused declarations is pointless. output 中缺少命名空间urn:cablelabs:md:xsd:terms:3.0 (“terms”),因为它在输入中未使用并且保留未使用的声明毫无意义。

The same thing applies to the "schemaLocation" - because you wrote it as a namespace declaration ( xmlns:schemaLocation ), ElementTree saw that this "namespace" was unused and stripped it.同样的事情也适用于“schemaLocation”——因为您将其编写为命名空间声明( xmlns:schemaLocation ),ElementTree 看到这个“命名空间”未被使用并剥离了它。 Correct would have been an attribute with a namespace ( xsi:schemaLocation ).正确的应该是具有命名空间 ( xsi:schemaLocation ) 的属性。 When you correct that error, this item will stay in the output.当您更正该错误时,该项目将保留在 output 中。

To sum it all up: You don't have a problem.总结一下:你没有问题。 The output document is the same. output文档也是一样。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM