PHP - 用于节点遍历 DOM 以获取特定标签的正则表达式/函数

Question

I'm using Goutte to crawl an URL with PHP.我正在使用 Goutte 通过 PHP 抓取 URL。

I want to save a list <ul>...</ul> just after this tag : Maladies fréquentes :我想在这个标签之后保存一个列表<ul>...</ul> ： Maladies fréquentes :

The DOM looks like this structure : DOM 看起来像这样的结构：

<p>....</p>
<p>....</p>
<p>....</p>
<p>....</p>
...
<h2>...</h2>
...
<ul>...</ul>
...
<p><strong>Maladies fréquentes :</strong></p>
<ul>
<li>Text I need</li>
<li>Text I need</li>
</ul>
...
<p></p>
<p></p>
...

Actually, I save to my DB using :first-of-type实际上，我使用:first-of-type保存到我的数据库

$crawler->filter('.desc ul:first-of-type li')->each(function ($node) use (&$out) {

   $li = array();

   if ($node->count() > 0) {
        $li[] = str_replace('"', "'", trim($node->filter('li')->text()));
   }

   // Insert into DV

}

When the content contains 2 or 3 <ul>...</ul> It always save wrong li because all ul are selected.当内容包含 2 或 3 个<ul>...</ul>总是保存错误的 li 因为所有的 ul 都被选中。

How can I select only the <ul> after Maladies fréquentes : ?如何在Maladies fréquentes :之后只选择<ul> > ？

Thanks !谢谢！

Answer 1

Don't know much about Goutte, but I believe you can load the crawler object into DomDocument and then parse it with xpath.对Goutte不太了解，但相信可以将爬虫对象加载到DomDocument中，然后用xpath解析。 Something like:就像是：

$doc = new DOMDocument();    
$doc->loadHTML($crawler);
#or possibly: $doc->loadHTML((string)$crawler);
$xpath = new DOMXPath($doc);
$targets = $xpath->query('//p[strong]/following-sibling::ul[1]//li');
#or possibly: $targets = $xpath->query('//p[contains(strong,"Maladies")]/following-sibling::ul[1]//li');
foreach ($targets as $source) {
    echo($source->nodeValue."\r\n");
};

The output should be输出应该是

Text I need
Text I need

PHP - 用于节点遍历 DOM 以获取特定标签的正则表达式/函数

问题描述

1 个解决方案

解决方案1
0 已采纳 2022-06-22 18:22:37

PHP - 用于节点遍历 DOM 以获取特定标签的正则表达式/函数

问题描述

1 个解决方案

解决方案1 0 已采纳 2022-06-22 18:22:37

解决方案1
0 已采纳 2022-06-22 18:22:37