简体   繁体   English

使用node.js扩展Web抓取网站

[英]Scale web scraping site with node.js

I'm developing a web scraping website to find available delivery restaurants. 我正在开发一个网络抓取网站,以查找可用的外送餐厅。 The website searches on the most popular delivery portals and shows the result aggregated in a single page. 该网站在最受欢迎的交付门户网站上进行搜索,并在单个页面中显示汇总的结果。

The site is hosted on Heroku with 4 dynos. 该网站托管在Heroku上,带有4个dynos。

http://deliveria.net/#05409-002 http://deliveria.net/#05409-002

When a user makes a request on the website, it makes around 30 HTTP requests to retrieve the result. 当用户在网站上发出请求时,它会发出大约30个HTTP请求来检索结果。

The problem is the performance, the requests aren't fast and each search can make 30 of them, locking the app while the search is being performed for a single user. 问题是性能,请求不是很快,每个搜索可以搜索30个请求,从而在为单个用户执行搜索时锁定了应用程序。

I tried to increase Heroku dynos: 我试图增加Heroku测功机:

 heroku scale web=10

And I didn't feel a perceptible gain. 而且我没有感觉到明显的收获。

What is the best approach to scale this kind of application? 扩展此类应用程序的最佳方法是什么?

(I can't use cache, as the searches need to be in real time) (我不能使用缓存,因为搜索需要实时进行)

Current stack: 当前堆栈:

  • Heroku Heroku
  • Node.js Node.js
  • express 表达
  • request module 请求模块
  • EJS EJS
  • Pusher 推杆
  • Redis 雷迪斯

The important thing here is to have workers, because you must avoid blocking the event loop in your main app. 这里重要的是要有工作人员,因为您必须避免在主应用程序中阻塞事件循环。

Try to delegate the 30 http requests between the available workers. 尝试在可用的工作程序之间委派30个http请求。 Maybe Kue can help you with this aspect (you push new jobs to the queue and they get executed one by one by the workers). 也许Kue可以在这方面为您提供帮助(您将新作业推到队列中,然后它们就会由工人一一执行)。 So for example if you have 10 dynos on Heroku, use 9 for workers (that make those 30 http searches). 因此,例如,如果您在Heroku上有10个dynos,请对工人使用9个(进行30次http搜索)。

From the user's point of view it's important to know that the application is reacting fast to his search (and doesn't give him the 'freeze' impression), so maybe you would like to update him as soon as you have preliminary results (for example 10 pages get searched out of 30). 从用户的角度来看,重要的是要知道该应用程序对他的搜索做出了快速反应(并且没有给他“冻结”的印象),因此也许您希望在获得初步结果后立即对其进行更新(对于例如,从30个页面中搜索10个页面。 You could do that via WebSockets ( Socket.IO ) and even show a nice graphical progress bar or something similar. 您可以通过WebSockets( Socket.IO )做到这一点 ,甚至可以显示漂亮的图形进度条或类似的东西。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM