简体   繁体   中英

curl - Scraping large amounts of content from a website

I'm curious if anyone has any recommendations as to the best method to leverage PHP/CURL (or another technology even) to download content from a website. Right now I'm using curl_multi to do 10 requests at a time, which helps some.

I literally need to request about 100K pages daily, which can get a bit tedious (takes 16 hours right now). My initial thoughts are just setting up multiple virtual machines and splitting up the task, but was wondering if there is something else I'm missing besides parallelization. (I know you can always throw more machines at the problem heh)

Thanks in advance!

It depends what you're doing with the content but try a queuing system.

I suggest Resque . It uses Redis to handle queues. It's designed for speed and multiple requests at the same time. It also has a resque-web option that gives out a nice hosted UI.

You could use one machine to queue up new URLs and then you can have one or multiple machines handling the queues.

Other options: Kestrel , RabbitMQ , Beanstalkd

To retrieve a Web content you can use curl or fsockopen. A comparison between two methods can be see in Which is better approach between fsockopen and curl? .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM