[英]How to extract contents from URLs?
I am having a problem. 我遇到了问题。 This is what I have to do and the code is taking extremely long to run: 这是我必须要做的,代码运行时间极长:
There is 1 website I need to collect data from, and to do so I need my algorithm to visit over 15,000 subsections of this website (ie www.website.com/item.php?rid= $_id
), where $_id
will be the current iteration of a for
loop. 有一个网站我需要从中收集数据,为此我需要我的算法访问该网站的15,000个子部分(即www.website.com/item.php?rid= $_id
),其中$_id
将是for
循环的当前迭代。
Here are the problems: 以下是问题:
file_get_contents
, and, as you can imagine, it takes super long to file_get_contents
of 15,000+ pages. 我目前用于获取每个页面的源代码的方法是file_get_contents
,并且,如您所想,对于15,000多页的file_get_contents
,需要超长时间。 2
exists but www.website.com/item.php?rid= 3
does not), so I need a method of quickly skipping over these pages before the algorithm tries to fetch its contents and waste a bunch of time. 有些页面不存在(即www.website.com/item.php?rid= 2
存在,但www.website.com/item.php?rid= 3
不存在),所以我需要一种快速跳过的方法在算法尝试获取其内容并浪费大量时间之前,在这些页面上。 In short, I need a method of extracting a small portion of the page from 15,000 webpages in as quick and efficient a manner as possible. 简而言之,我需要一种方法,以尽可能快速有效的方式从15,000个网页中提取页面的一小部分。
Here is my current code. 这是我目前的代码。
for ($_id = 0; $_id < 15392; $_id++){
//****************************************************** Locating page
$_location = "http://www.website.com/item.php?rid=".$_id;
$_headers = @get_headers($_location);
if(strpos($_headers[0],"200") === FALSE){
continue;
} // end if
$_source = file_get_contents($_location);
//****************************************************** Extracting price
$_needle_initial = "<td align=\"center\" colspan=\"4\" style=\"font-weight: bold\">Current Price:";
$_needle_terminal = "</td>";
$_position_initial = (stripos($_source,$_needle_initial))+strlen($_needle_initial);
$_position_terminal = stripos($_source,$_needle_terminal);
$_length = $_position_terminal-$_position_initial;
$_current_price = strip_tags(trim(substr($_source,$_position_initial,$_length)));
} // end for
Any help at all is greatly appreciated since I really need a solution to this! 任何帮助都非常感谢,因为我真的需要一个解决方案!
Thank you in advance for your help! 预先感谢您的帮助!
the short of it: don't. 缺点:不要。
longer: If you want to do this much work, you shouldn't do it on demand. 更长:如果你想做这么多工作,你不应该按需做。 Do it in the background! 在后台做吧! You can use the code you have here, or any other method you're comfortable with, but instead of showing it to a user, you can save it in a database or a local file. 您可以使用此处的代码或您熟悉的任何其他方法,但不是将其显示给用户,而是可以将其保存在数据库或本地文件中。 Call this script with a cron job every x minutes (depends on the interval you need), and just show the latest content from your local cache (be it a database or a file). 每x分钟使用一个cron作业调用此脚本(取决于您需要的时间间隔),并显示本地缓存中的最新内容(无论是数据库还是文件)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.