[英]PHP DOMDocument get text between two SETS of tags
有沒有一種方法可以使用Xpath解析兩個標簽集之間的文本? 例如,請參見示例:
<div class="par">
<p class="pp">
<span class="dv">1 </span>Blah blah blah blah. <span class="dv">2 </span> Yada
yada yada yada. <span class="dv">3 </span>Foo foo foo foo.
</p>
</div>
<div class="par">
<p class="pp">
<span class="dv">4 </span>Hmm hmm hmm hmm.
</p>
</div>
我想通過獲取SPAN標記集之間的文本來解析如下數組:
array[0] = "Blah blah blah blah.";
array[1] = "Yada yada yada yada.";
array[2] = "Foo foo foo foo.";
array[3] = "Hmm hmm hmm hmm.";
我可以使用DOMDocument簡單地做到這一點嗎? 如果沒有,實現此目標的最佳方法是什么? 請注意,句子中間可能有或標記。 如:
...<span class="dv">5 </span>Uhh uhh <a href="www.uhh.com">uhh</a> uhh. <span class="dv">6 </span>...
UPDATE
似乎您確實想要一個簡單的列表,所以我添加了這個特定的示例,因此不會造成混淆:
$html = '<div class="par">
<p class="pp">
<span class="dv">1 </span>Blah blah blah blah. <span class="dv">2 </span> Yada
yada yada yada. <span class="dv">3 </span>Foo foo foo foo.
</p>
</div>
<div class="par">
<p class="pp">
<span class="dv">4 </span>Hmm hmm hmm hmm.
</p>
</div>';
$dom = DOMDocument::loadHTML($html);
$finder = new DOMXPath($dom);
// select THE TEXT NODES of all p elements with the class pp
// - note that means its explictly class="pp",
// not that "pp" is anywhere in the class list you may need to change this up depending...
// post additional questions for specific xpath help
$found = $finder->query('//p[@class="pp"]/text()');
$nodes = array();
// simply transform the resulting DOMNodeList into an array
// for easier consumption/manipulation
foreach($found as $textNode) {
$node[] = $textNode->nodeValue;
}
print_r($nodes);
生產:
Array
(
[0] =>
[1] => Blah blah blah blah.
[2] => Yada
yada yada yada.
[3] => Foo foo foo foo.
[4] =>
[5] => Hmm hmm hmm hmm.
)
如果情況總是如此簡單,我想您可以使用xpath來獲取p.pp中子DOMText節點的內容。
$html = '<div class="par">
<p class="pp">
<span class="dv">1 </span>Blah blah blah blah. <span class="dv">2 </span> Yada
yada yada yada. <span class="dv">3 </span>Foo foo foo foo.
</p>
</div>
<div class="par">
<p class="pp">
<span class="dv">4 </span>Hmm hmm hmm hmm.
</p>
</div>';
$dom = DOMDocument::loadHTML($html);
$finder = new DOMXPath($dom);
// select all p elements with the class pp - note that means its explictly class="pp",
// not that "pp" is anywhere in the class list you may need to change this up depending...
// post additional questions for specific xpath help
$found = $finder->query('//p[@class="pp"]');
$nodes = array();
foreach($found as $p) {
// for each p element, pull its text nodes.
$textNodes = $finder->query('text()', $p);
$textStr = '';
// loop over the textNodes and concat them into a single string
foreach ($textNodes as $n) {
$textStr .= $n->nodeValue;
}
// push the compiled string onto the array
$nodes[] = $textStr;
}
print_r($nodes);
這將產生如下結果:
Array
(
[0] =>
Blah blah blah blah. Yada
yada yada yada. Foo foo foo foo.
[1] =>
Hmm hmm hmm hmm.
)
如果您確實確實希望每個文本節點分別存在,則只需更改循環即可:
foreach($found as $p) {
// for each p element, pull its text nodes.
$textNodes = $finder->query('text()', $p);
$textArr = array();
// loop over the textNodes and concat them into a single string
foreach ($textNodes as $n) {
$textArr[] = $n->nodeValue;
}
// push the compiled string onto the array
$nodes[] = $textArr;
}
這會給你:
Array
(
[0] => Array
(
[0] =>
[1] => Blah blah blah blah.
[2] => Yada
yada yada yada.
[3] => Foo foo foo foo.
)
[1] => Array
(
[0] =>
[1] => Hmm hmm hmm hmm.
)
)
顯然,如您所見,它已經抓住了換行符,如果不希望出現的換行符,則可以使用所選的數組過濾方法輕松過濾掉它們。 或者,您可以查看XPath和DOMDocument設置來對此進行調整,IIRC中有一些設置涉及如何解釋(或不解釋)空格,這些設置可能會讓您避免這種情況,但是如果在其他位置進行其他處理,也會產生其他后果。相同的DOMDocument
實例。
您需要第一個文本節點,該文本節點是span元素之后緊隨其后的兄弟節點:
//span/following-sibling::text()[1]
在PHP語法中這是1:1:
$doc = new DOMDocument();
$doc->loadHTML($buffer, LIBXML_HTML_NOIMPLIED);
$xpath = new DOMXPath($doc);
$expr = '//span/following-sibling::text()[1]';
$result = $xpath->evaluate($expr);
然后,您希望將結果文本節點轉換為字符串數組。 我想說的是,當您使自己已經可以工作時,請對其進行一些空格標准化:
$array = array_map(function(DOMText $text) {
return preg_replace(['~\s+~u', '~^ | $~'], [' ', ''], $text->nodeValue);
}, iterator_to_array($result));
結果是:
[
"Blah blah blah blah.",
"Yada yada yada yada.",
"Foo foo foo foo.",
"Hmm hmm hmm hmm."
]
完整的代碼示例:
<?php
/**
* http://stackoverflow.com/questions/27674012/php-domdocument-get-text-between-two-sets-of-tags
*/
$buffer = <<<HTML
<div class="par">
<p class="pp">
<span class="dv">1 </span>Blah blah blah blah. <span class="dv">2 </span> Yada
yada yada yada. <span class="dv">3 </span>Foo foo foo foo.
</p>
</div>
<div class="par">
<p class="pp">
<span class="dv">4 </span>Hmm hmm hmm hmm.
</p>
</div>
HTML;
$doc = new DOMDocument();
$doc->loadHTML($buffer, LIBXML_HTML_NOIMPLIED);
$xpath = new DOMXPath($doc);
$expr = '//span/following-sibling::text()[1]';
$result = $xpath->evaluate($expr);
$array = array_map(function(DOMText $text) {
return preg_replace(['~\s+~u', '~^ | $~'], [' ', ''], $text->nodeValue);
}, iterator_to_array($result));
echo json_encode($array, JSON_PRETTY_PRINT);
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.