简体   繁体   中英

How to extract contents from URLs?

I am having a problem. This is what I have to do and the code is taking extremely long to run:
There is 1 website I need to collect data from, and to do so I need my algorithm to visit over 15,000 subsections of this website (ie www.website.com/item.php?rid= $_id ), where $_id will be the current iteration of a for loop.
Here are the problems:

  1. The method I am currently using to get the source code of each page is file_get_contents , and, as you can imagine, it takes super long to file_get_contents of 15,000+ pages.
  2. Each page contains over 900 lines of code, but all I need to extract is about 5 lines worth, so it seems as though the algorithm is wasting a lot of time by retrieving all 900 lines of it.
  3. Some of the pages do not exist (ie maybe www.website.com/item.php?rid= 2 exists but www.website.com/item.php?rid= 3 does not), so I need a method of quickly skipping over these pages before the algorithm tries to fetch its contents and waste a bunch of time.

In short, I need a method of extracting a small portion of the page from 15,000 webpages in as quick and efficient a manner as possible.
Here is my current code.

for ($_id = 0; $_id < 15392; $_id++){
    //****************************************************** Locating page
    $_location = "http://www.website.com/item.php?rid=".$_id;
    $_headers = @get_headers($_location);
    if(strpos($_headers[0],"200") === FALSE){
        continue;
    } // end if
    $_source = file_get_contents($_location);
    //****************************************************** Extracting price
    $_needle_initial = "<td align=\"center\" colspan=\"4\" style=\"font-weight: bold\">Current Price:";
    $_needle_terminal = "</td>";
    $_position_initial = (stripos($_source,$_needle_initial))+strlen($_needle_initial);
    $_position_terminal = stripos($_source,$_needle_terminal);
    $_length = $_position_terminal-$_position_initial;
    $_current_price = strip_tags(trim(substr($_source,$_position_initial,$_length)));
} // end for

Any help at all is greatly appreciated since I really need a solution to this!
Thank you in advance for your help!

the short of it: don't.

longer: If you want to do this much work, you shouldn't do it on demand. Do it in the background! You can use the code you have here, or any other method you're comfortable with, but instead of showing it to a user, you can save it in a database or a local file. Call this script with a cron job every x minutes (depends on the interval you need), and just show the latest content from your local cache (be it a database or a file).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM