简体   繁体   中英

Two different XML namespaces with the same URL

I am trying to do some data cleaning using the xml element tree library in python.

My xml input files look like this:

<?xml version="1.0" encoding="UTF-8"?>
<mods:mods xmlns:mods="http://www.loc.gov/mods/v3" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.loc.gov/mods/v3" version="3.5" xsi:schemaLocation="http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-5.xsd">
  <mods:titleInfo>
    <mods:title>1971, Human Events</mods:title>
  </mods:titleInfo>
  <mods:name type="personal" authority="naf" valueURI="https://lccn.loc.gov/n88172648">
    <mods:namePart>Kellems, Vivien, 1896-1975</mods:namePart>
    <mods:role>
      <mods:roleTerm authority="marcrelator" authorityURI="http://id.loc.gov/vocabulary/relators" valueURI="http://id.loc.gov/vocabulary/relators/col" type="text">Collector</mods:roleTerm>
    </mods:role>
  </mods:name>
  <mods:typeOfResource>text</mods:typeOfResource>
  <mods:genre authority="aat" valueURI="300111999">publications (documents)</mods:genre>
  <mods:originInfo>
    <mods:dateIssued encoding="w3cdtf" keyDate="yes">1971</mods:dateIssued>
  </mods:originInfo>
  <mods:physicalDescription>
    <mods:digitalOrigin>reformatted digital</mods:digitalOrigin>
    <mods:internetMediaType>image/jp2</mods:internetMediaType>
  </mods:physicalDescription>
  <mods:note type="ownership">Archives &amp; Special Collections at the Thomas J. Dodd Research Center, University of Connecticut Library</mods:note>
  <mods:identifier type="local">1992-0033/SeriesIII:Activism/SubseriesA:PoliticalCampaigns/Box138:6</mods:identifier>
  <mods:identifier type="local">MSS 1992.0033</mods:identifier>
  <mods:identifier type="local">39153030468468</mods:identifier>
  <mods:accessCondition type="use and reproduction">In Copyright</mods:accessCondition>
  <mods:recordInfo>
    <mods:recordContentSource>University of Connecticut Library</mods:recordContentSource>
    <mods:recordCreationDate encoding="w3cdtf">2018-07-09-04:00</mods:recordCreationDate>
    <mods:languageOfCataloging>
      <mods:languageTerm authority="iso639-2b" type="code">eng</mods:languageTerm>
    </mods:languageOfCataloging>
  </mods:recordInfo>
  <mods:note type="source note">Vivien Kellems Papers</mods:note>
  <mods:note type="source identifier">MSS 1992.0033</mods:note>
  <identifier type="hdl">http://hdl.handle.net/11134/20002:860633493</identifier>
</mods:mods>

All I have to do is change the identifier tag at the end to have the same prefix as the rest of the tags, the "mods" prefix. And to add a specific hlink attribute to the accessCondition tag. I have successfully done both of those things. But after I write these modifications back to the file and try to use the xml element tree parser, I get the following error:

xml.etree.ElementTree.ParseError: unbound prefix: line 25, column 2

Now I think this is a namespace issue because the the "xmlns:mods" namespace and the "xmlns" namespace have the same url so when I register the namespace into the parser like so:

ET.register_namespace('', "http://www.loc.gov/mods/v3")
ET.register_namespace('mods', "http://www.loc.gov/mods/v3")
ET.register_namespace('xlink', "http://www.w3.org/1999/xlink")
ET.register_namespace('xsi', "http://www.w3.org/2001/XMLSchema-instance")

It also removes one of the namespaces when I write back to the xml file, the namespace declarations look like this:

<mods:mods xmlns:mods="http://www.loc.gov/mods/v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="3.5" xsi:schemaLocation="http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-5.xsd">

Namely, the "xmlns" declaration. Only the "xmlns:mods" declaration is shown. And again I think this is due to them having the same urls. Is there anyway to fix this. Any help would be appreciated.

http://www.loc.gov/mods/v3 is the namespace. mods is nothing but an abbreviation (aka the "prefix"). You can have as many different abbreviations for the same namespace in your XML document as you want.

For example:

<something xmlns="http://www.loc.gov/mods/v3">
  <mods:something_else xmlns:mods="http://www.loc.gov/mods/v3" />
  <blah:another_thing xmlns:blah="http://www.loc.gov/mods/v3" />
  <last_thing />
</something>

and

<mods:something xmlns:mods="http://www.loc.gov/mods/v3" xmlns:blah="http://www.loc.gov/mods/v3">
  <something_else xmlns="http://www.loc.gov/mods/v3" />
  <mods:another_thing />
  <blah:last_thing />
</mods:something>

and any number of other combinations represent exactly the same document .

When those are parsed, and then serialized again, all those namespace declarations could be retained exactly as they are, or they could be folded into a single one, the prefixes could be renamed to ns0 , or it could be turned into a default namespace - it does not matter. It completely depends on the way the XML library is implemented.

As long as every element in the resulting document is in the http://www.loc.gov/mods/v3 namespace, it's the same document by any relevant metric:

<something xmlns="http://www.loc.gov/mods/v3">
  <something_else />
  <another_thing  />
  <last_thing />
</something>

In other words, there is nothing broken, so nothing needs to be fixed.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM