简体   繁体   中英

HTML DOM scraping using Simple HTML DOM PHP class

I am having trouble targeting the 'plain text' (author name) in this HTML snippet..

I will have MANY of these on a page.. and I am using the SIMPLE HTML DOM SCRAPER PHP Class.

Located here: http://sourceforge.net/projects/simplehtmldom/files/

Its pretty nice and fairly easy to use/understand.. Im just a bit stuck on how I can target my 'plain text' (author name in this demo)

<tr>
    <td style="vertical-align: top;">Some Time xx:xx am</td>
    <td><a href="javascript:void(0)" onclick="window.open('link-path-url.ext'); return false;"><strong>Some Title</strong></a>&nbsp;&nbsp;<img alt="VIDEO" border="0" height="12" src="/images/template/video_icon.jpg" width="12" /><br />Author Name<em> - Institute Name</em></td>
</tr>

I need to grab 4 values from each 'block' like this:

link/path - grabbing correctly so far

title - grabbing correctly so far

author name - this is the one Im having a problem targeting

institute name - grabbing correctly so far

Here is the PHP I have been playing with/testing with so far:

foreach($html->find('tbody td a') as $element){
    echo 'LINK: ' . $parsedLink = substr($element->onclick, 13, -17) . '<br>';
    $title = $element->find('strong',0);
    echo 'TITLE: '. $title . '<br>';
    $institute = $element->parent()->last_child();
    echo 'INSTITUTE: '. $institute . '<br>';
    //$author = $element->parent()->find('text');
    $author = $element->parent()->last_child()->prev_sibling();
    echo 'AUTHOR: '. $author . '<br>';
}

I've tried using inntertext, outtertext, plaintext, text blocks..etc

but I can NOT seem to target the 'plain text' (innertext?) that is before the <em></em> element? (author name text)

How can I target/grab this value/element/text?

correct way to target the above was/is like so:

foreach($html->find('tbody td a[onclick]') as $element){
    $parsedLink = substr($element->onclick, 13, -17);
    $title = $element->find('strong',0);
    $author = $element->parent()->find('text'); // <-- returns array
    $institute = $element->parent()->last_child();
    echo 'LINK: ' . $parsedLink . '<br>';    
    echo 'TITLE: '. $title . '<br>';    
    echo 'AUTHOR: '. $author[2] . '<br>';
    echo 'INSTITUTE: '. $institute . '<br>';     
}

hopefully it'll help others!

thanks!

Personally i'm sick of using simple html dom parser, memory issues with PHP Simple HTML DOM Parser just gave me too much headache, and the problem exist too long without good solution for my taste (I know, I know, You just need to free memory manually, but try working with recursive functions... ). The truth is, that the simplest solutions are the best, so I started using combinations of explode() functions, which are enough for 98% of all my scraping problems (and much faster too, creating and destroying dom object takes some time). Try this:

class Scrap {

    private $link;
    private $title;
    private $institute;
    private $author;
    private $html;

    function __construct($url) {
        $this->html = $this->curlDownload($url);
    }

    private function curlDownload($Url){
        if (!function_exists('curl_init')){
            die('Sorry cURL is not installed!');
        }
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL, $Url);
        curl_setopt($ch, CURLOPT_REFERER, "http://www.google.com");
        curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");
        curl_setopt($ch, CURLOPT_HEADER, 0);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_TIMEOUT, 10);
        $output = curl_exec($ch);
        curl_close($ch);
        return $output;
    }

    function scrapLink() {
        if(empty($this->link)) {
            $link = explode('<td><a href="javascript:void(0)" onclick="window.open(\'', $this->html);
            $link = explode('\')', $link[1]);
            $link = $link[0];
            $this->link = $link;
        }
        return $this->link;
    }

    function scrapTitle() {
        if(empty($this->title)) {
            $title = explode('<td><a href="javascript:void(0)" onclick="window.open(\'link-path-url.ext\'); return false;"><strong>', $this->html);
            $title = explode('</strong>', $title[1]);
            $title = $title[0];
            $this->title = $title;
        }
        return $this->title;
    }

    function scrapInstitute() {
        if(empty($this->institute)) {
            $institute = explode('<td><a href="javascript:void(0)" onclick="window.open(\'link-path-url.ext\'); return false;"><strong>Some Title</strong></a>&nbsp;&nbsp;<img alt="VIDEO" border="0" height="12" src="/images/template/video_icon.jpg" width="12" /><br />Author Name<em> -', $this->html);
            $institute = explode('</em>', $institute[1])
            $institute = trim($institute[0]);
            $this->institute = $institute;
        }
        return $this->institute;
    }

    function scrapAuthor() {
        if(empty($this->author)) {
            $author = explode('<td><a href="javascript:void(0)" onclick="window.open(\'link-path-url.ext\'); return false;"><strong>Some Title</strong></a>&nbsp;&nbsp;<img alt="VIDEO" border="0" height="12" src="/images/template/video_icon.jpg" width="12" /><br />', $this->html);
            $author = explode('<em>', $author[1])
            $author = $author[0];
            $this->author = $author;
        }
        return $this->author;
    }

    function scrapAll() {
        $this->scrapLink();
        $this->scrapTitle();
        $this->scrapInstitute();
        $this->scrapAuthor();
        return array($this->link, $this->title, $this->institute, $this->author);
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM