简体繁体中英

curl - Scraping large amounts of content from a website

原文 2013-03-08 21:47:57 3 2 php/ curl

I'm curious if anyone has any recommendations as to the best method to leverage PHP/CURL (or another technology even) to download content from a website. Right now I'm using curl_multi to do 10 requests at a time, which helps some.

I literally need to request about 100K pages daily, which can get a bit tedious (takes 16 hours right now). My initial thoughts are just setting up multiple virtual machines and splitting up the task, but was wondering if there is something else I'm missing besides parallelization. (I know you can always throw more machines at the problem heh)

Thanks in advance!

2 answers

It depends what you're doing with the content but try a queuing system.

I suggest Resque . It uses Redis to handle queues. It's designed for speed and multiple requests at the same time. It also has a resque-web option that gives out a nice hosted UI.

You could use one machine to queue up new URLs and then you can have one or multiple machines handling the queues.

Other options: Kestrel , RabbitMQ , Beanstalkd

To retrieve a Web content you can use curl or fsockopen. A comparison between two methods can be see in Which is better approach between fsockopen and curl? .

Retrieving and scraping large amounts of data from third party websites

PHP + cURL - Scraping data from a website with user profile using REACT

Using curl for scraping large pages

php curl fetch content from website is not working

PHP cURL Website scraping not working

Sum large numbers in for loop while scraping data from other website

Caching large amounts of content with PHP + MySQL

How to parse website content received from a website with curl

Scraping a website with cURL request not reading the HTML code

curl scraping a single website two levels deep

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Retrieving and scraping large amounts of data from third party websites PHP + cURL - Scraping data from a website with user profile using REACT Using curl for scraping large pages php curl fetch content from website is not working PHP cURL Website scraping not working Sum large numbers in for loop while scraping data from other website Caching large amounts of content with PHP + MySQL How to parse website content received from a website with curl Scraping a website with cURL request not reading the HTML code curl scraping a single website two levels deep

Related Tags

curl - Scraping large amounts of content from a website

Question

2 answers

solution1
2 ACCPTED 2013-03-08 21:54:05

solution2
0 2013-03-08 21:55:09

curl - Scraping large amounts of content from a website

Question

2 answers

solution1 2 ACCPTED 2013-03-08 21:54:05

solution2 0 2013-03-08 21:55:09

solution1
2 ACCPTED 2013-03-08 21:54:05

solution2
0 2013-03-08 21:55:09