I need to scrape my (real estate) clients old site so the data will be in the new one I have created.
Using curl.
Two levels deep. Index page and then the property detail page.
In the index page I need curl to get the number of pages, so the next part of my script can delve into all those pages and get all the property data for each property.
In the first function ( parseURL
) I need to get the number of pages
/* This function does the initial parsing to get the number of pages */
public function parseURL($url) {
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$curl_scraped_page = curl_exec($ch);
// bring back all the html of the page
// echo "csp=$curl_scraped_page";
$data = str_replace(array("\n", "\r"), "", preg_replace('/(?:(?<=\>)|(?<=\/\>))(\s+)(?=\<\/?)/', "", $curl_scraped_page));
curl_close($ch);
$regex = '#<div style="float:right;width:540px"><h3 style="margin-top:0px">(.*)</h3><h4>(.*)</h4>(.*)<div style="padding:5px"><a href="(.*)">(.*)</a></div></div>#siU';
// $regex = '#<div class="propertyListLinks"><a href="(.*)">(.*)</a></div#siU';
preg_match_all($regex, $data, $this->details);
$regex2 = '#[[0-9]{1,4}]#';
// echo "<br />data=\n$data<br />";
preg_match_all($regex2, $data, $this->pagination);
// exit;
}
Written many moons ago for me, I don't recall what the regex is doing and I want to understand this so I can develop it for my current needs.
Please advise me on:
#
doing in a $regex
and $regex2
strings? siU
mean at the end of the $regex
string? 1) #
is a REGEX pattern delimiter - ie denotes the start and end of your pattern. Hashes are one of several delimiter characters allowed as delimiters under the PCRE flavour of REGEX that PHP uses.
2) These are flags , which tell the pattern how to behave in certain regards. In your case:
s
means the pattern should treat dots ( .
) as an alias for any character, including newlines. i
means the pattern should ignore case U
means any repeaters should match in ungreedy fashion
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.