简体繁体 English

卷曲多重爬行问题

[英]curl multi crawling issues

原文 2012-01-16 10:16:15 1 1 php/ curl

We have a crawling engine catering to around 500,000 vistors per month. 我们拥有一个可满足每月约500,000位访客的爬行引擎。 We use curl as of now to get the web-pages. 到目前为止，我们使用curl来获取网页。 We have recently used multi-exec with curl to crawl pages simultaneously. 我们最近使用了带有curl的多执行程序来同时抓取页面。 We set it to crawl like 20 pages simultaneously. 我们将其设置为同时抓取20页。

Now during the process of getting the web-pages curl will completely stop until all 20 pages have been fetched and only then will move on to the next 20. Its like if one page is slow in being fetched then curl will wait for that page to load till it moves on to the next loop in which i get the next 20 pages. 现在，在获取网页的过程中，卷曲将完全停止，直到所有20个页面都已被提取，然后才移至下一个20个页面。这就像如果一页提取速度很慢，那么卷曲将等待该页面加载，直到它进入下一个循环，在该循环中我得到接下来的20页。

Is there any other way to overcome this? 还有其他方法可以克服吗？ I hope my question is clear. 希望我的问题清楚。

Later 后来

By overcoming i mean just image curl is fetching 20 pages simultaneously. 通过克服，我的意思是图像卷曲只能同时获取20页。 The ones that are fetched are instantaneously replaced by newer items to to be fetched without having to wait for all 20 to finish? 所获取的内容立即被要获取的较新项目替换，而不必等待所有20个项目完成？ Clear? 明确？

1 个解决方案

Sure, just add a new handles with a new URL once one is complete. 当然，只需在完成后添加带有新URL的新句柄。 There's no need to wait for all the 20 to complete first. 无需等待所有20个都先完成。 That's just plain inefficient. 那简直就是低效。

And you can of course bump the 20 to 200 or 600 or whatever if you rather want that... 当然，如果您愿意，也可以将20提升至200或600，或者其他任何方式...

See http://curl.haxx.se/libcurl/c/libcurl-multi.html for an overview on how the multi interface works on the C level. 有关多接口如何在C级别上工作的概述，请参见http://curl.haxx.se/libcurl/c/libcurl-multi.html 。 The PHP/CURL API is just a thin layer on top. PHP / CURL API只是最薄的一层。

In PHP, curl_multi_exec () will return a counter of "running handles" that decreases when one or more transfers have completed. 在PHP中， curl_multi_exec （）将返回一个“正在运行的句柄”的计数器，该计数器在完成一次或多次传输后会减少。 You can (and should) also call curl_multi_info_read () to figure out exactly which transfer that finished and its individual return code. 您还可以（并且应该）调用curl_multi_info_read （）来准确找出完成的传输及其返回码。