简体   繁体   中英

curl scraping a single website two levels deep

I need to scrape my (real estate) clients old site so the data will be in the new one I have created.
Using curl.
Two levels deep. Index page and then the property detail page.
In the index page I need curl to get the number of pages, so the next part of my script can delve into all those pages and get all the property data for each property.

In the first function ( parseURL ) I need to get the number of pages

/* This function does the initial parsing to get the number of pages */
public function parseURL($url) {
  $ch = curl_init($url);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
  $curl_scraped_page = curl_exec($ch);
  // bring back all the html of the page
  // echo "csp=$curl_scraped_page";
  $data = str_replace(array("\n", "\r"), "", preg_replace('/(?:(?<=\>)|(?<=\/\>))(\s+)(?=\<\/?)/', "", $curl_scraped_page));
  curl_close($ch);
  $regex = '#<div style="float:right;width:540px"><h3 style="margin-top:0px">(.*)</h3><h4>(.*)</h4>(.*)<div style="padding:5px"><a href="(.*)">(.*)</a></div></div>#siU';
  // $regex = '#<div class="propertyListLinks"><a href="(.*)">(.*)</a></div#siU';
  preg_match_all($regex, $data, $this->details);
  $regex2 = '#[[0-9]{1,4}]#';
  // echo "<br />data=\n$data<br />";
  preg_match_all($regex2, $data, $this->pagination);
  // exit;
}

Written many moons ago for me, I don't recall what the regex is doing and I want to understand this so I can develop it for my current needs.

Please advise me on:

  1. what is the # doing in a $regex and $regex2 strings?
  2. What does siU mean at the end of the $regex string?

1) # is a REGEX pattern delimiter - ie denotes the start and end of your pattern. Hashes are one of several delimiter characters allowed as delimiters under the PCRE flavour of REGEX that PHP uses.

2) These are flags , which tell the pattern how to behave in certain regards. In your case:

  • s means the pattern should treat dots ( . ) as an alias for any character, including newlines.
  • i means the pattern should ignore case
  • U means any repeaters should match in ungreedy fashion

Full reference

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM