简体   繁体   中英

curl multi crawling issues

We have a crawling engine catering to around 500,000 vistors per month. We use curl as of now to get the web-pages. We have recently used multi-exec with curl to crawl pages simultaneously. We set it to crawl like 20 pages simultaneously.

Now during the process of getting the web-pages curl will completely stop until all 20 pages have been fetched and only then will move on to the next 20. Its like if one page is slow in being fetched then curl will wait for that page to load till it moves on to the next loop in which i get the next 20 pages.

Is there any other way to overcome this? I hope my question is clear.

Later

By overcoming i mean just image curl is fetching 20 pages simultaneously. The ones that are fetched are instantaneously replaced by newer items to to be fetched without having to wait for all 20 to finish? Clear?

Sure, just add a new handles with a new URL once one is complete. There's no need to wait for all the 20 to complete first. That's just plain inefficient.

And you can of course bump the 20 to 200 or 600 or whatever if you rather want that...

See http://curl.haxx.se/libcurl/c/libcurl-multi.html for an overview on how the multi interface works on the C level. The PHP/CURL API is just a thin layer on top.

In PHP, curl_multi_exec () will return a counter of "running handles" that decreases when one or more transfers have completed. You can (and should) also call curl_multi_info_read () to figure out exactly which transfer that finished and its individual return code.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM