简体   繁体   English

使用简单HTML DOM PHP类的HTML DOM抓取

[英]HTML DOM scraping using Simple HTML DOM PHP class

I am having trouble targeting the 'plain text' (author name) in this HTML snippet.. 我在此HTML代码段中定位“纯文本”(作者姓名)时遇到问题。

I will have MANY of these on a page.. and I am using the SIMPLE HTML DOM SCRAPER PHP Class. 我将在页面上看到许多这些信息。.并且我正在使用SIMPLE HTML DOM SCRAPER PHP类。

Located here: http://sourceforge.net/projects/simplehtmldom/files/ 位于此处: http : //sourceforge.net/projects/simplehtmldom/files/

Its pretty nice and fairly easy to use/understand.. Im just a bit stuck on how I can target my 'plain text' (author name in this demo) 它非常好用,而且易于使用/理解。.我在如何定位“纯文本”(此演示中的作者姓名)方面有些停留

<tr>
    <td style="vertical-align: top;">Some Time xx:xx am</td>
    <td><a href="javascript:void(0)" onclick="window.open('link-path-url.ext'); return false;"><strong>Some Title</strong></a>&nbsp;&nbsp;<img alt="VIDEO" border="0" height="12" src="/images/template/video_icon.jpg" width="12" /><br />Author Name<em> - Institute Name</em></td>
</tr>

I need to grab 4 values from each 'block' like this: 我需要从每个“块”中获取4个值,如下所示:

link/path - grabbing correctly so far 链接/路径-到目前为止正确抓取

title - grabbing correctly so far 标题-到目前为止正确抓取

author name - this is the one Im having a problem targeting 作者姓名-这是我遇到问题的一位

institute name - grabbing correctly so far 机构名称-到目前为止正确抓取

Here is the PHP I have been playing with/testing with so far: 到目前为止,这是我一直在使用/测试的PHP:

foreach($html->find('tbody td a') as $element){
    echo 'LINK: ' . $parsedLink = substr($element->onclick, 13, -17) . '<br>';
    $title = $element->find('strong',0);
    echo 'TITLE: '. $title . '<br>';
    $institute = $element->parent()->last_child();
    echo 'INSTITUTE: '. $institute . '<br>';
    //$author = $element->parent()->find('text');
    $author = $element->parent()->last_child()->prev_sibling();
    echo 'AUTHOR: '. $author . '<br>';
}

I've tried using inntertext, outtertext, plaintext, text blocks..etc 我试过使用inttertext,outtertext,明文,文本块..etc

but I can NOT seem to target the 'plain text' (innertext?) that is before the <em></em> element? 但我似乎无法定位<em></em>元素之前的“纯文本”(内文?)? (author name text) (作者姓名文字)

How can I target/grab this value/element/text? 如何定位/获取此值/元素/文本?

correct way to target the above was/is like so: 针对上述目标的正确方法如下:

foreach($html->find('tbody td a[onclick]') as $element){
    $parsedLink = substr($element->onclick, 13, -17);
    $title = $element->find('strong',0);
    $author = $element->parent()->find('text'); // <-- returns array
    $institute = $element->parent()->last_child();
    echo 'LINK: ' . $parsedLink . '<br>';    
    echo 'TITLE: '. $title . '<br>';    
    echo 'AUTHOR: '. $author[2] . '<br>';
    echo 'INSTITUTE: '. $institute . '<br>';     
}

hopefully it'll help others! 希望它会帮助别人!

thanks! 谢谢!

Personally i'm sick of using simple html dom parser, memory issues with PHP Simple HTML DOM Parser just gave me too much headache, and the problem exist too long without good solution for my taste (I know, I know, You just need to free memory manually, but try working with recursive functions... ). 我个人讨厌使用简单的html dom解析器,PHP的内存问题简单的HTML DOM解析器让我头疼不已,而且这个问题存在太久了,没有适合我的口味的好解决方案(我知道,我知道,您只需要手动释放内存,但尝试使用递归函数...)。 The truth is, that the simplest solutions are the best, so I started using combinations of explode() functions, which are enough for 98% of all my scraping problems (and much faster too, creating and destroying dom object takes some time). 事实是,最简单的解决方案是最好的,所以我开始使用explode()函数的组合,这些组合足以解决我所有刮刮问题的98%(而且速度也更快,创建和销毁dom对象需要一些时间)。 Try this: 尝试这个:

class Scrap {

    private $link;
    private $title;
    private $institute;
    private $author;
    private $html;

    function __construct($url) {
        $this->html = $this->curlDownload($url);
    }

    private function curlDownload($Url){
        if (!function_exists('curl_init')){
            die('Sorry cURL is not installed!');
        }
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL, $Url);
        curl_setopt($ch, CURLOPT_REFERER, "http://www.google.com");
        curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");
        curl_setopt($ch, CURLOPT_HEADER, 0);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_TIMEOUT, 10);
        $output = curl_exec($ch);
        curl_close($ch);
        return $output;
    }

    function scrapLink() {
        if(empty($this->link)) {
            $link = explode('<td><a href="javascript:void(0)" onclick="window.open(\'', $this->html);
            $link = explode('\')', $link[1]);
            $link = $link[0];
            $this->link = $link;
        }
        return $this->link;
    }

    function scrapTitle() {
        if(empty($this->title)) {
            $title = explode('<td><a href="javascript:void(0)" onclick="window.open(\'link-path-url.ext\'); return false;"><strong>', $this->html);
            $title = explode('</strong>', $title[1]);
            $title = $title[0];
            $this->title = $title;
        }
        return $this->title;
    }

    function scrapInstitute() {
        if(empty($this->institute)) {
            $institute = explode('<td><a href="javascript:void(0)" onclick="window.open(\'link-path-url.ext\'); return false;"><strong>Some Title</strong></a>&nbsp;&nbsp;<img alt="VIDEO" border="0" height="12" src="/images/template/video_icon.jpg" width="12" /><br />Author Name<em> -', $this->html);
            $institute = explode('</em>', $institute[1])
            $institute = trim($institute[0]);
            $this->institute = $institute;
        }
        return $this->institute;
    }

    function scrapAuthor() {
        if(empty($this->author)) {
            $author = explode('<td><a href="javascript:void(0)" onclick="window.open(\'link-path-url.ext\'); return false;"><strong>Some Title</strong></a>&nbsp;&nbsp;<img alt="VIDEO" border="0" height="12" src="/images/template/video_icon.jpg" width="12" /><br />', $this->html);
            $author = explode('<em>', $author[1])
            $author = $author[0];
            $this->author = $author;
        }
        return $this->author;
    }

    function scrapAll() {
        $this->scrapLink();
        $this->scrapTitle();
        $this->scrapInstitute();
        $this->scrapAuthor();
        return array($this->link, $this->title, $this->institute, $this->author);
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM