简体   繁体   中英

using simple html dom to scrape

I am trying to scrape some content using simple_html_dom without luck.

I am trying to grab the title, image path and the link and display it.

The HTML structure is:

<div class="article_item clearfix">
<h2 class="title"><a href="http://www.demodomain/articleid=1">My amazing Title</a></h2>
<p class="date">September 22 2014</p>
<p class="image_left">
<a href="http://www.demodomain/articleid=1">
<img src="http://www.demodomain/photos/cef78533cd5.jpg" alt="My amazing post ">
</a>
</p>
<p>This is a demo description<strong>of this amazing</strong> article</p>
<p class="more"><a href="http://www.demodomain/articleid=1" class="blued_links">Read more...</a></p>
</div>

My code so far:

foreach($html->find('article_item') as $article) {
    $item['title']   = $article->find('.title, a', 0)->plaintext;
    $item['thumb']  = $article->find('.image_left img', 0)->src;
    $item['details'] = $article->find('p', 0)->plaintext;
    $item['url'] = $article->find('.more, a', 0)->plaintext;
       


echo 'Title: ' . $item['title'];
echo "</br>";
echo "image url: " . $item['thumb'];
echo "</br>";
echo "Description: " . $item['details'];
echo "</br>";
echo "Read More Url: " . $item['url'];
}



// Clear dom object
$html->clear(); 
unset($html); 

You didn't state whats not working but consider this example:

foreach($html->find('div.article_item') as $div) {
                 //  ^ point to div tag with class name article_item
    $title = $div->find('h2.title a ', 0)->innertext;
                     // ^ target the h2 tag with class title with child anchor
                     // just same as accessing dom with jquery
    $thumb = $div->find('p.image_left img ', 0)->src;
    $details = $div->children(3)->plaintext;
    // $url = $div->find('p.more', 0)->plaintext;
    $url = $div->find('p.more a', 0)->href;

    echo $title . '<br/>';
    echo $thumb . '<br/>';
    echo $details . '<br/>';
    echo $url . '<br/>';
}

Basically, this is just the same as selecting selectors.

can you try like this

$item['title']   = $article->find('h2.title')->plaintext;
$item['thumb']  = $article->find('p.image_left')->find('img')->src;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM