简体   繁体   中英

ignore namespace with xpath in php

I want to pull some tags from a xml file. The xml file might be like this:

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="de">
[... some more tags ...]
  <page>
    <title>Title 1</title>
    [... some more tags ...]
  </page>
  <page>
    <title>Title 2</title>
    [... some more tags ...]
  </page>
</mediawiki>

When I use https://www.freeformatter.com/xpath-tester.html to pull "//title" everything works and I receive the two titles.

But when I use the following php:

$xml = simplexml_load_file('articles.xml');
$result = $xml->xpath('//title');
var_dump($result);

the resulting array is empty.

I already checked many of the similar questions and found that it would work if I set registerXPathNamespace with the same URL. However, the XMLs I am reading are coming from several external sources with different software (the above is only one possible example). They might change at any time. So every time I open an XML I would need to read out the URL and put it into registerXPathNamespace. Another option to make it work would be to strip the xmlns from the XML. Both options seem to be pretty complicated if all I want to do is to extract the "title" (and some other) tags no matter what the namespace is.

Is there a simple way to tell xpath to ignore the namespace? (And if there is no way to ignore it: what would be the most simple and durable solution to avoid the problem of changing URLs?)

Up to now I am using the hard coded

foreach ($xml->page as $page) {
  $title = $page->title;
  //[... do something ...]
}

which works. But I thought xpath would be handy (more flexible, not hard coded, more durable) and wanted to give it a try.

您可以通配名称空间,例如//*:title

You can fetch the namespaces from the document and then register the default one from these. It's a bit of a pain as the default namespace ends up with a blank key, but this is why it's a bit of a fudge to get the first value from the array and then use this.

So the code is something like:

$xml = simplexml_load_file('articles.xml');
$ns = $xml->getDocNamespaces();
$xml->registerXPathNamespace('def', array_values($ns)[0]);
$result = $xml->xpath('//def:title');
var_dump($result);

Though the chosen solution of registering a default namespace works, it also requires that I clutter up my xpath queries for seemingly no reason. In my particular case, and I suspect many others, it's more helpful to completely remove the namespace from the document. Unfortunately, there doesn't appear to be any way to do this using DOM tools in php so I had to resort to a regex. And let me say, I really hate doing this since I am one of those people who repeatedly chastises others for manipulating XML and HTML with regex.

Anyway, here's what worked for me:

$xml = file_get_contents('my_document.xml');
$xml = preg_replace('/(xmlns|xsi)[^=]*="[^"]*" ?/i', '', $xml);
$doc = simplexml_load_string($xml);

And voilà, now you can query xpath as desired, without the namespace prefix:

$result = $xml->xpath('//title');

Depending on your document, this may be a really bad idea, especially if there are namespace prefixes on your elements, but in many basic cases it will work just fine.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM