PHP DOMDocument獲取兩個標簽集合之間的文本

Question

有沒有一種方法可以使用Xpath解析兩個標簽集之間的文本？ 例如，請參見示例：

<div class="par">
  <p class="pp">
    <span class="dv">1 </span>Blah blah blah blah. <span class="dv">2 </span> Yada 
    yada yada yada. <span class="dv">3 </span>Foo foo foo foo.
  </p>
</div>
<div class="par">
  <p class="pp">
    <span class="dv">4 </span>Hmm hmm hmm hmm. 
  </p>
</div>

我想通過獲取SPAN標記集之間的文本來解析如下數組：

array[0] = "Blah blah blah blah.";
array[1] = "Yada yada yada yada.";
array[2] = "Foo foo foo foo.";
array[3] = "Hmm hmm hmm hmm.";

我可以使用DOMDocument簡單地做到這一點嗎？ 如果沒有，實現此目標的最佳方法是什么？ 請注意，句子中間可能有或標記。 如：

...<span class="dv">5 </span>Uhh uhh <a href="www.uhh.com">uhh</a> uhh. <span class="dv">6 </span>...

Answer 1

UPDATE

似乎您確實想要一個簡單的列表，所以我添加了這個特定的示例，因此不會造成混淆：

$html = '<div class="par">
  <p class="pp">
    <span class="dv">1 </span>Blah blah blah blah. <span class="dv">2 </span> Yada 
    yada yada yada. <span class="dv">3 </span>Foo foo foo foo.
  </p>
</div>
<div class="par">
  <p class="pp">
    <span class="dv">4 </span>Hmm hmm hmm hmm. 
  </p>
</div>';

$dom = DOMDocument::loadHTML($html);
$finder = new DOMXPath($dom);
// select THE TEXT NODES of all p elements with the class pp 
// - note that means its explictly class="pp",
// not that "pp" is anywhere in the class list you may need to change this up depending...
// post additional questions for specific xpath help
$found = $finder->query('//p[@class="pp"]/text()');

$nodes = array();
// simply transform the resulting DOMNodeList into an array
// for easier consumption/manipulation
foreach($found as $textNode) {
    $node[] = $textNode->nodeValue;
}

print_r($nodes);

生產：

Array
(
    [0] => 

    [1] => Blah blah blah blah. 
    [2] =>  Yada 
    yada yada yada. 
    [3] => Foo foo foo foo.

    [4] => 

    [5] => Hmm hmm hmm hmm. 

)

如果情況總是如此簡單，我想您可以使用xpath來獲取p.pp中子DOMText節點的內容。

$html = '<div class="par">
  <p class="pp">
    <span class="dv">1 </span>Blah blah blah blah. <span class="dv">2 </span> Yada 
    yada yada yada. <span class="dv">3 </span>Foo foo foo foo.
  </p>
</div>
<div class="par">
  <p class="pp">
    <span class="dv">4 </span>Hmm hmm hmm hmm. 
  </p>
</div>';

$dom = DOMDocument::loadHTML($html);
$finder = new DOMXPath($dom);
// select all p elements with the class pp - note that means its explictly class="pp",
// not that "pp" is anywhere in the class list you may need to change this up depending...
// post additional questions for specific xpath help
$found = $finder->query('//p[@class="pp"]');

$nodes = array();

foreach($found as $p) {
    // for each p element, pull its text nodes.
    $textNodes = $finder->query('text()', $p);
    $textStr = '';
    // loop over the textNodes and concat them into a single string
    foreach ($textNodes as $n) {
        $textStr .= $n->nodeValue;
    }
    // push the compiled string onto the array
    $nodes[] = $textStr;
}

print_r($nodes);

這將產生如下結果：

Array
(
    [0] => 
    Blah blah blah blah.  Yada 
    yada yada yada. Foo foo foo foo.

    [1] => 
    Hmm hmm hmm hmm. 

)

如果您確實確實希望每個文本節點分別存在，則只需更改循環即可：

foreach($found as $p) {
    // for each p element, pull its text nodes.
    $textNodes = $finder->query('text()', $p);
    $textArr = array();
    // loop over the textNodes and concat them into a single string
    foreach ($textNodes as $n) {
        $textArr[] = $n->nodeValue;
    }
    // push the compiled string onto the array
    $nodes[] = $textArr;
}

這會給你：

Array
(
    [0] => Array
        (
            [0] => 

            [1] => Blah blah blah blah. 
            [2] =>  Yada 
    yada yada yada. 
            [3] => Foo foo foo foo.

        )

    [1] => Array
        (
            [0] => 

            [1] => Hmm hmm hmm hmm. 

        )

)

顯然，如您所見，它已經抓住了換行符，如果不希望出現的換行符，則可以使用所選的數組過濾方法輕松過濾掉它們。 或者，您可以查看XPath和DOMDocument設置來對此進行調整，IIRC中有一些設置涉及如何解釋（或不解釋）空格，這些設置可能會讓您避免這種情況，但是如果在其他位置進行其他處理，也會產生其他后果。相同的DOMDocument實例。

Answer 2

您需要第一個文本節點，該文本節點是span元素之后緊隨其后的兄弟節點：

//span/following-sibling::text()[1]

在PHP語法中這是1：1：

$doc = new DOMDocument();
$doc->loadHTML($buffer, LIBXML_HTML_NOIMPLIED);
$xpath = new DOMXPath($doc);

$expr   = '//span/following-sibling::text()[1]';
$result = $xpath->evaluate($expr);

然后，您希望將結果文本節點轉換為字符串數組。 我想說的是，當您使自己已經可以工作時，請對其進行一些空格標准化：

$array = array_map(function(DOMText $text) {
    return preg_replace(['~\s+~u', '~^ | $~'], [' ', ''], $text->nodeValue);
}, iterator_to_array($result));

結果是：

[
    "Blah blah blah blah.",
    "Yada yada yada yada.",
    "Foo foo foo foo.",
    "Hmm hmm hmm hmm."
]

完整的代碼示例：

<?php
/**
 * http://stackoverflow.com/questions/27674012/php-domdocument-get-text-between-two-sets-of-tags
 */

$buffer = <<<HTML
<div class="par">
  <p class="pp">
    <span class="dv">1 </span>Blah blah blah blah. <span class="dv">2 </span> Yada
    yada yada yada. <span class="dv">3 </span>Foo foo foo foo.
  </p>
</div>
<div class="par">
  <p class="pp">
    <span class="dv">4 </span>Hmm hmm hmm hmm.
  </p>
</div>
HTML;

$doc = new DOMDocument();
$doc->loadHTML($buffer, LIBXML_HTML_NOIMPLIED);
$xpath = new DOMXPath($doc);

$expr   = '//span/following-sibling::text()[1]';
$result = $xpath->evaluate($expr);

$array = array_map(function(DOMText $text) {
    return preg_replace(['~\s+~u', '~^ | $~'], [' ', ''], $text->nodeValue);
}, iterator_to_array($result));

echo json_encode($array, JSON_PRETTY_PRINT);

PHP DOMDocument獲取兩個標簽集合之間的文本

問題描述

2 個解決方案

解決方案1
3 已采納 2014-12-28 06:16:35

解決方案2
1 2015-01-01 19:06:10

PHP DOMDocument獲取兩個標簽集合之間的文本

問題描述

2 個解決方案

解決方案1 3 已采納 2014-12-28 06:16:35

解決方案2 1 2015-01-01 19:06:10

解決方案1
3 已采納 2014-12-28 06:16:35

解決方案2
1 2015-01-01 19:06:10