简体   繁体   English

使用PHP cURL和XPath进行爬网,如何加快处理速度?

[英]Scraping with PHP cURL and XPath, how to speed up things?

Currently I'm scraping using PHP cURL and XPath, but it is very slow. 目前,我正在使用PHP cURL和XPath进行抓取,但速度非常慢。

Each website has many URLs with many subpages using Javascript. 每个网站都有许多使用Javascript的URL和许多子页面。

One website would have say 30 categories of products and each category has about 70 subpages with 10 items on each. 一个网站上说有30​​个产品类别,每个类别大约有70个子页面,每个子页面上有10个项目。

I scrape about 150 webpages in total with the above. 以上我总共抓取了约150个网页。

One script takes one website and scrapes all the URLs from that page one at the time. 一个脚本访问一个网站,然后一次抓取该页面的所有URL。 At the same time another script is running doing the same. 同时,另一个脚本正在运行。

Each script takes one URL, fetches the data into a variable, and that then gets scraped using XPath, then values are stored in the DB. 每个脚本使用一个URL,将数据提取到一个变量中,然后使用XPath进行抓取,然后将值存储在DB中。

Many of the pages uses Javascript with Microsoft ASP.NET Viewstate, so many loops need to be executed in order to jump from page 1 to page 2, etc. 许多页面将Javascript与Microsoft ASP.NET Viewstate一起使用,因此需要执行许多循环才能从第1页跳转到第2页,依此类推。

One script may run for about 2 hours getting everything from a single website. 一个脚本可能需要运行大约2个小时,才能从单个网站获取所有内容。

What can be done speeding things up? 如何加快速度?

I have been thinking about doing the same as above, but only storing each page locally first, and then when every page from a single website is stored then scrape them. 我一直在考虑做与上述相同的操作,但只先在本地存储每个页面,然后再存储单个网站的每个页面时再进行抓取。

Anyone with great exprience in this? 有人对此很有经验吗? Javascript/viewstate has to taken into consideration, so I can't just wget everything first. 必须考虑Javascript / viewstate,所以我不能只是先了解所有内容。

You can use mutli-curl to fetched multiple pages at once. 您可以使用mutli-curl一次获取多个页面。 If you wanted to, you could request all 30 category pages in a single mutli-curl request. 如果需要,您可以在一个mutli-curl请求中请求所有30个类别页面。 For processing each page, you can use forking (pctl_fork). 要处理每个页面,可以使用派生(pctl_fork)。 Combining those two techniques, your computer CPU/network can become the bottleneck. 结合这两种技术,您的计算机CPU /网络可能成为瓶颈。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM