[英]How to scrape dynamic data with PHP Simple HTML DOM Parser
first let me say that I have read over numerous "scrapping" threads on here and none have been of help to me. 首先,我要说的是,我在这里阅读了许多“报废”线程,但对我没有帮助。 I also checked around the internet for days and now I am getting close to the wire I am hoping someone can shed some light on this for me.
我也检查了几天的互联网,现在我已经接近电线,希望有人可以帮我一下。
I am using PHP Simple HTML DOM Parser to scrape some data from a page. 我正在使用PHP Simple HTML DOM解析器从页面中抓取一些数据。 The url I am working with serves dynamic content and I can not seem to get anything to work to pull that content in. I need to scrape the text(plain) from
<tr id="0" class="ui-widget-content jqgrow ui-row-ltr" role="row">
to <tr id="9" class="ui-widget-content jqgrow ui-row-ltr" role="row">
, I feel like once I get one to work I can get the others. 我正在使用的url提供动态内容,但似乎无法进行任何操作来提取该内容。我需要从
<tr id="0" class="ui-widget-content jqgrow ui-row-ltr" role="row">
到<tr id="9" class="ui-widget-content jqgrow ui-row-ltr" role="row">
,我觉得一旦获得工作,我可以得到其他人。 Because this info is not actually on the page when the page is loaded but rather comes into the fold after the page loads I am in a rutt. 因为在加载页面时此信息实际上不在页面上,而是在页面加载后进入折叠状态,所以我很不高兴。
With that said, here is what I have tried: 话虽如此,这是我尝试过的:
echo file_get_html('http://sheriffclevelandcounty.com/p2c/jailinmates.aspx')->plaintext;
The above will show me everything BUT the info I need, like this: 上面的内容将向我展示所有需要的信息,例如:
I also tried using the example from the plugin using IMDb and modified to my needs, this is it: 我还尝试了使用IMDb插件中的示例,并根据需要进行了修改,就是这样:
// Defining the basic cURL function
function curl($url) {
// Assigning cURL options to an array
$options = Array(
CURLOPT_RETURNTRANSFER => TRUE, // Setting cURL's option to return the webpage data
CURLOPT_FOLLOWLOCATION => TRUE, // Setting cURL to follow 'location' HTTP headers
CURLOPT_AUTOREFERER => TRUE, // Automatically set the referer where following 'location' HTTP headers
CURLOPT_CONNECTTIMEOUT => 120, // Setting the amount of time (in seconds) before the request times out
CURLOPT_TIMEOUT => 120, // Setting the maximum amount of time for cURL to execute queries
CURLOPT_MAXREDIRS => 10, // Setting the maximum number of redirections to follow
CURLOPT_USERAGENT => "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1a2pre) Gecko/2008073000 Shredder/3.0a2pre ThunderBrowse/3.2.1.8", // Setting the useragent
CURLOPT_URL => $url, // Setting cURL's URL option with the $url variable passed into the function
);
$ch = curl_init(); // Initialising cURL
curl_setopt_array($ch, $options); // Setting cURL's options using the previously assigned array data in $options
$data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
curl_close($ch); // Closing cURL
return $data; // Returning the data from the function
}
// Defining the basic scraping function
function scrape_between($data, $start, $end){
$data = stristr($data, $start); // Stripping all data from before $start
$data = substr($data, strlen($start)); // Stripping $start
$stop = stripos($data, $end); // Getting the position of the $end of the data to scrape
$data = substr($data, 0, $stop); // Stripping all data from after and including the $end of the data to scrape
return $data; // Returning the scraped data from the function
}
$scraped_page = curl("http://sheriffclevelandcounty.com/p2c/jailinmates.aspx"); // Downloading IMDB home page to variable $scraped_page
$scraped_data = scrape_between($scraped_page, '<table id="tblII" class="ui-jqgrid-btable" cellspacing="0" cellpadding="0" border="0" role="grid" aria-multiselectable="false" aria-labelledby="gbox_tblII" style="width: 456px;">', '</table>'); // Scraping downloaded dara in $scraped_page for content between <title> and </title> tags
echo $scraped_data; // Echoing $scraped data, should show "The Internet Movie Database (IMDb)"
Of course neither of these work, so my question is: How do I use the PHP Simple DOM Parser to get dynamic content that is loaded after page load? 当然,这些都不起作用,所以我的问题是:如何使用PHP Simple DOM分析器获取页面加载后加载的动态内容? Is it possible or am I just completely on the wrong track here?
有可能还是我完全走错了路?
I understand that you need the dynamic data that comes in the jqgrid. 我了解您需要jqgrid中提供的动态数据。 For that you can use post URL which in response gives the data.
为此,您可以使用发布网址,以提供数据。
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://sheriffclevelandcounty.com/p2c/jqHandler.ashx?op=s");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch,CURLOPT_POST, 1);
curl_setopt($ch,CURLOPT_POSTFIELDS, array(
'rows'=>10000, //Here you can specify how many records you want
't'=>'ii'
));
$output = curl_exec($ch);
curl_close($ch);
echo "<pre>";
print_r(json_decode($output));
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.