简体   繁体   English

如何使用Simple HTML DOM和cURL排除特定类别的孩子的抓取结果?

[英]How to exclude scraping results depending on children with specific class using Simple HTML DOM and cURL?

I am scraping a certain website for specific links, which I am saving to my $url_results array. 我正在抓取某个网站的特定链接,并将其保存到$ url_results数组中。 But want to exclude adding the link to the array if the li cluster, with the class of list-items__item , includes a child->child->child span with a class of list-items__item__notice . 但是,如果具有list-items__item类的li集群包括一个带有list-items__item__notice类的child-> child-> child span则要排除将链接添加到数组的可能性

Cluster I am scraping: 我正在抓取的群集:

<li>
    <a href="" data-lpurl=""> <!--The href I am scraping-->
        <span class="list-items__item__position"></span>
        <div class="list-items__item__title">
            <span class="list-items__item__notice"> <!--I don't want to add to my array if this span is present-->
            </span>
        </div>
    </a>
</li>

My PHP scraping function: 我的PHP抓取功能:

$items = $html->find('li[class=list-items__item]');  
foreach($items as $post) {
    $url_results[] = $url . ($post->children(0)->href);
}

I am using Simple HTML DOM and cURL to scrape. 我正在使用简单HTML DOM和cURL进行抓取。

I solved the problem by adding an if-sentence, checking whether the tag was empty and if so, add the href to my array, if not, do nothing, as below: 我通过添加if语句,检查标签是否为空来解决该问题,如果是,则将href添加到我的数组中;如果不是,则不执行任何操作,如下所示:

foreach($items as $post) {
    if (empty($post->children(0)->children(1)->children(0)->plaintext)) {
        $url_results[] = $url . ($post->children(0)->href);
    }
    else {}
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM