简体繁体中英

curl multi crawling issues

原文 2012-01-16 10:16:15 8 1 php/ curl

We have a crawling engine catering to around 500,000 vistors per month. We use curl as of now to get the web-pages. We have recently used multi-exec with curl to crawl pages simultaneously. We set it to crawl like 20 pages simultaneously.

Now during the process of getting the web-pages curl will completely stop until all 20 pages have been fetched and only then will move on to the next 20. Its like if one page is slow in being fetched then curl will wait for that page to load till it moves on to the next loop in which i get the next 20 pages.

Is there any other way to overcome this? I hope my question is clear.

Later

By overcoming i mean just image curl is fetching 20 pages simultaneously. The ones that are fetched are instantaneously replaced by newer items to to be fetched without having to wait for all 20 to finish? Clear?

1 answers

Sure, just add a new handles with a new URL once one is complete. There's no need to wait for all the 20 to complete first. That's just plain inefficient.

And you can of course bump the 20 to 200 or 600 or whatever if you rather want that...

See http://curl.haxx.se/libcurl/c/libcurl-multi.html for an overview on how the multi interface works on the C level. The PHP/CURL API is just a thin layer on top.

In PHP, curl_multi_exec () will return a counter of "running handles" that decreases when one or more transfers have completed. You can (and should) also call curl_multi_info_read () to figure out exactly which transfer that finished and its individual return code.

PHP cURL multi handling causing random connection issues between servers?

jquery address crawling - logic issues

PHP CURL: Crawling multiples pages in a loop

crawling with curl - just cant get it right

PHP Multi curl or multi threading

Issues Posting to Database with cURL

issues with accessing URL with cURL

cURL to PHP Translation Issues

having some issues with curl

PHP - cURL + strpos Issues

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question PHP cURL multi handling causing random connection issues between servers? jquery address crawling - logic issues PHP CURL: Crawling multiples pages in a loop crawling with curl - just cant get it right PHP Multi curl or multi threading Issues Posting to Database with cURL issues with accessing URL with cURL cURL to PHP Translation Issues having some issues with curl PHP - cURL + strpos Issues

Related Tags

curl multi crawling issues

Question

Later

1 answers

solution1 1 ACCPTED 2012-01-16 12:03:20

solution1
1 ACCPTED 2012-01-16 12:03:20