简体   繁体   中英

Combine CURL and Simple HTML DOM for scraping data

I have assigned a task to scrape data from a site which is password protected, I did it through CURL but now i want to get link inside that html returned by CURL, and go to that link and grab data from there. I passed the response of CURL into file_get_contents() but not working. Here is my CURL code.

$ckfile = tempnam("/tmp", "CURLCOOKIE");
$useragent = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.2 (KHTML,    like Gecko) Chrome/5.0.342.3 Safari/533.2';

$username = "XXXXXX";
$password = "XXXXXX";


$f = fopen('log.txt', 'w'); // file to write request header for debug purpose


$ch = curl_init($url);
curl_setopt($ch, CURLOPT_COOKIEJAR, $ckfile);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $useragent);

 $html = curl_exec($ch);

 curl_close($ch);

preg_match('~<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="(.*?)" />~', $html, $viewstate);
preg_match('~<input type="hidden" name="__EVENTVALIDATION" id="__EVENTVALIDATION"   value="(.*?)" />~', $html, $eventValidation);

$viewstate = $viewstate[1];
$eventValidation = $eventValidation[1];




$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, false);
curl_setopt($ch, CURLOPT_COOKIEJAR, $ckfile);
curl_setopt($ch, CURLOPT_COOKIEFILE, $ckfile);
curl_setopt($ch, CURLOPT_HEADER, FALSE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_REFERER, $url);
curl_setopt($ch, CURLOPT_VERBOSE, 1);
curl_setopt($ch, CURLOPT_STDERR, $f);
curl_setopt($ch, CURLOPT_USERAGENT, $useragent);

// Collecting all POST fields
$postfields = array();
$postfields['__EVENTTARGET'] = "";
$postfields['__EVENTARGUMENT'] = "";
$postfields['__VIEWSTATE'] = $viewstate;
$postfields['__EVENTVALIDATION'] = $eventValidation;
$postfields['ctl00$LoginPopup1$Login1$UserName'] = $username;
$postfields['ctl00$LoginPopup1$Login1$Password'] = $password;
$postfields['ctl00$LoginPopup1$Login1$LoginButton'] = 'Log In';

curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postfields);
$ret = curl_exec($ch); // Get result after login page.

Here is simple html dom code

$html = file_get_contents($ret);

This is error i am getting

Warning: file_get_contents(1): failed to open stream: No such file or directory

Any other suggestion how to do it will be appreciated. thanks

If you are wanting the HTML output of the page you are sending the request to, try setting CURLOPT_RETURNTRANSFER to true , then $ret should contain the HTML of the page after you have CURL'd one out.

I wouldn't use DOMDocument to parse the response, as the HTML from the page may not be correctly formatted and DOMDocument will complain.

If you are just looking for links you could use preg_match_all on the HTML.

Like MajorCaiger says, you need to set CURLOPT_RETURNTRANSFER to true, and then load that with str_get_html :

$html = curl_exec($ch);
$doc = str_get_html($html);

Even still, I don't think you have much of a chance of success with this, those asp forms are very tricky.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM