简体   繁体   English

使用XPath拉出包含子节点的完整节点

[英]Pulling out a full node with child nodes using XPath

I'm using XPath to select an section from an HTML page. 我正在使用XPath从HTML页面中选择一个部分。 However when I use XPath to extract the node, it correctly selects only the text surrounding the HTML tags and not the HTML tags themselves. 但是,当我使用XPath提取节点时,它只能正确选择HTML标记周围的文本而不是 HTML标记本身。

Sample HTML 示例HTML

<body>
    <div>
      At first glance you may ask, &#8220;what <i>exactly</i>
      do you mean?&#8221; It means that we want to help <b>you</b> figure...
    </div>
</body>

I have the following XPath 我有以下XPath

/body/div

I get the following 我得到以下内容

At first glance you may ask, &#8220;what do you mean?&#8221; It means that we want to help figure...

I want 我想要

At first glance you may ask, &#8220;what <i>exactly</i> do you mean?&#8221; It means that we want to help <b>you</b> figure...

If you notice in the Sample HTML there is a <i/> and <b /> HTML tags in the content. 如果您在示例HTML中注意到内容中有<i/><b /> HTML标记。 The words within those tags are "lost" when I extract the content. 当我提取内容时,这些标签中的单词“丢失”。

I'm using SimpleXML in PHP if that makes a difference. 我在PHP中使用SimpleXML,如果这有所不同。

Your XPath is fine, though you can remove the final /. 您的XPath很好,但您可以删除最终/. as that's redundant: 因为这是多余的:

/atom/content

All of the HTML is inside of a <![CDATA ]]> section so in the XML DOM you actually only have text there. 所有HTML都在<![CDATA ]]>部分内,所以在XML DOM中你实际上只有文本。 The <i> and <b> tags will not be parsed as tags but will just show up as text. <i><b>标签不会被解析为标签,而只会显示为文本。 Using a CDATA section is exactly the same as if your XML were written like this: 使用CDATA部分与XML的编写方式完全相同:

<atom>
    <content>
      At first glance you may ask, &amp;#8220;what &lt;i&gt;exactly&lt;/i&gt;
      do you mean?&amp;#8221; It means that we want to help &lt;b&gt;you&lt;/b&gt; figure...
    </content>
</atom>

So, it's whatever you're doing with the <content> element afterwards that's dropping those tags. 所以,无论你在使用<content>元素做什么,都会丢弃这些标签。 Are you later parsing the text as HTML, or running it through a filter, or something like that? 您是稍后将文本解析为HTML,还是通过过滤器运行,或类似的东西?

SimpleXML doesn't like text nodes so you'll have to use a custom solution instead. SimpleXML不喜欢文本节点,因此您必须使用自定义解决方案。

You can use asXML() on each div element then remove the div tags, or you can convert the div elements to DOMNode s then loop over $div->childNodes and serialize each child. 你可以在每个div元素上使用asXML()然后删除div标签,或者你可以将div元素转换为DOMNode然后循环遍历$div->childNodes并序列化每个子元素。 Note that your HTML entities will most likely be replaced by the actual characters if available. 请注意,如果可用,您的HTML实体很可能会替换为实际字符。

Alternatively, you can take a look at the SimpleDOM project and use its innerHTML() method. 或者,您可以查看SimpleDOM项目并使用其innerHTML()方法。

$html = 
'<body>
    <div>
      At first glance you may ask, &#8220;what <i>exactly</i>
      do you mean?&#8221; It means that we want to help <b>you</b> figure...
    </div>
</body>';

$body = simpledom_load_string($html);

foreach ($body->xpath('/body/div') as $div)
{
    var_dump($div->innerHTML());
}

I don't know if SimpleXML is different but to me it seems you need to make sure you're selecting all node types and not just text. 我不知道SimpleXML是否不同,但对我来说,似乎你需要确保选择所有节点类型而不仅仅是文本。 In standard XPath you would do /body/div/node() 在标准XPath中你会做/ body / div / node()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM