I have been working with CURL to scrape websites for a while and also Simple HTML DOM. I experienced that CURL is much better for scraping websites. However I really like the simplicity of Simple HTML DOM. So I figured why not combine the two, I tried:
require_once('simple_html_dom.php');
$url = 'http://news.yahoo.com/';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$curl_scraped_page = curl_exec($ch);
$html = new simple_html_dom();
$html->load($curl_scraped_page);
foreach($html->find('head') as $d) {
$d->innertext = "<base href='$url'>" . $d->innertext;
}
echo $html->save();
I did my best but it doesn't work. What else can I try?
Try changing this:
$html->load($curl_scraped_page);
To this:
$html->load($curl_scraped_page, true, false);
The problem is that simple_html_dom removes all \\r \\n by default and in this case it breaks javascript code since yahoo don't end it with a semicolon.
You can see this error at the browser console and you can also see that simple_html_dom removes linebreaks viewing the source.
I think I would add a function to the class
function loadWithoutRemovingStuff($str, $lowercase=true, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
{
$this->prepare($str, $lowercase, $stripRN, $defaultBRText, $defaultSpanText);
while ($this->parse());
$this->root->_[HDOM_INFO_END] = $this->cursor;
$this->parse_charset();
return $this;
}
and then call that function instead of the default load
function.
Or, since everything is public in this class,
$html = new simple_html_dom();
$html->prepare($str, $lowercase, $stripRN, $defaultBRText, $defaultSpanText);
while ($html->parse());
$html->root->_[HDOM_INFO_END] = $html->cursor;
$html->parse_charset();
but the first way is better (cleaner)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.