简体   繁体   English

忽略php中带有xpath的名称空间

[英]ignore namespace with xpath in php

I want to pull some tags from a xml file. 我想从xml文件中提取一些标签。 The xml file might be like this: xml文件可能是这样的:

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="de">
[... some more tags ...]
  <page>
    <title>Title 1</title>
    [... some more tags ...]
  </page>
  <page>
    <title>Title 2</title>
    [... some more tags ...]
  </page>
</mediawiki>

When I use https://www.freeformatter.com/xpath-tester.html to pull "//title" everything works and I receive the two titles. 当我使用https://www.freeformatter.com/xpath-tester.html拉“ // title”时,一切正常,并且我收到了两个标题。

But when I use the following php: 但是当我使用以下php时:

$xml = simplexml_load_file('articles.xml');
$result = $xml->xpath('//title');
var_dump($result);

the resulting array is empty. 结果数组为空。

I already checked many of the similar questions and found that it would work if I set registerXPathNamespace with the same URL. 我已经检查了许多类似的问题,并发现如果使用相同的URL设置registerXPathNamespace,它将可以使用。 However, the XMLs I am reading are coming from several external sources with different software (the above is only one possible example). 但是,我正在阅读的XML来自几个外部来源,这些来源带有不同的软件(以上只是一个可能的示例)。 They might change at any time. 它们可能随时更改。 So every time I open an XML I would need to read out the URL and put it into registerXPathNamespace. 因此,每次打开XML时,我都需要读出URL并将其放入registerXPathNamespace中。 Another option to make it work would be to strip the xmlns from the XML. 使它起作用的另一种方法是从XML中剥离xmlns。 Both options seem to be pretty complicated if all I want to do is to extract the "title" (and some other) tags no matter what the namespace is. 如果我想要做的就是不管名称空间是什么,都提取“ title”(和其他一些)标记,那么这两个选项似乎都非常复杂。

Is there a simple way to tell xpath to ignore the namespace? 有没有一种简单的方法告诉xpath忽略名称空间? (And if there is no way to ignore it: what would be the most simple and durable solution to avoid the problem of changing URLs?) (如果没有办法忽略它,那么避免更改URL问题的最简单,持久的解决方案是什么?)

Up to now I am using the hard coded 到目前为止,我正在使用硬编码

foreach ($xml->page as $page) {
  $title = $page->title;
  //[... do something ...]
}

which works. 哪个有效。 But I thought xpath would be handy (more flexible, not hard coded, more durable) and wanted to give it a try. 但是我认为xpath会很方便(更灵活,不硬编码,更耐用),并想尝试一下。

您可以通配名称空间,例如//*:title

You can fetch the namespaces from the document and then register the default one from these. 您可以从文档中获取名称空间,然后从中注册默认名称空间。 It's a bit of a pain as the default namespace ends up with a blank key, but this is why it's a bit of a fudge to get the first value from the array and then use this. 由于默认名称空间以空白键结尾,这有点麻烦,但这就是为什么从数组中获取第一个值然后使用它有点费解的原因。

So the code is something like: 因此,代码类似于:

$xml = simplexml_load_file('articles.xml');
$ns = $xml->getDocNamespaces();
$xml->registerXPathNamespace('def', array_values($ns)[0]);
$result = $xml->xpath('//def:title');
var_dump($result);

Though the chosen solution of registering a default namespace works, it also requires that I clutter up my xpath queries for seemingly no reason. 尽管选择的注册默认名称空间的解决方案有效,但它也要求我似乎毫无理由地使我的xpath查询混乱。 In my particular case, and I suspect many others, it's more helpful to completely remove the namespace from the document. 在我的特殊情况下,我怀疑还有许多其他情况,从文档中完全删除名称空间会更有帮助。 Unfortunately, there doesn't appear to be any way to do this using DOM tools in php so I had to resort to a regex. 不幸的是,并没有出现任何的方式来做到这一点使用DOM工具在PHP,所以我不得不求助于正则表达式。 And let me say, I really hate doing this since I am one of those people who repeatedly chastises others for manipulating XML and HTML with regex. 我要说的是, 我真的很讨厌这样做,因为我是一再因为使用正则表达式来操纵XML和HTML的人而不断地追求他人。

Anyway, here's what worked for me: 无论如何,这对我有用:

$xml = file_get_contents('my_document.xml');
$xml = preg_replace('/(xmlns|xsi)[^=]*="[^"]*" ?/i', '', $xml);
$doc = simplexml_load_string($xml);

And voilà, now you can query xpath as desired, without the namespace prefix: 而且,现在您可以根据需要查询xpath,而无需命名空间前缀:

$result = $xml->xpath('//title');

Depending on your document, this may be a really bad idea, especially if there are namespace prefixes on your elements, but in many basic cases it will work just fine. 根据文档的不同,这可能不是一个好主意,尤其是在元素上有名称空间前缀的情况下,但是在许多基本情况下,它就可以正常工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM