简体繁体 English

运行Flask Web服务器是否可以在Node.JS中阻止Web抓取？

[英]Does running a Flask web server preclude web scraping in Node.JS?

原文 2017-04-19 02:05:24 9 1 javascript/ node.js/ python-3.x/ web-scraping

I'm interested in trying a web scraping project. 我有兴趣尝试网络抓取项目。 The target sites use Javascript to dynamically load and update content. 目标站点使用Javascript动态加载和更新内容。 Most of the discussion online concerning web scraping such sites indicates node.js, casper.js, phantom.js, and nightmare.js are all reasonably popular tools to use when attempting such a project. 关于网络抓取此类网站的大多数在线讨论都表明，在尝试进行此类项目时，node.js，casper.js，phantom.js和nightmare.js都是相当流行的工具。 Node.js seems to be used most often. Node.js似乎是最常用的。

If I am running a Flask server and wish to display the results of a node.js, for example, scrape in tabular format on my site, is this possible? 如果我运行的是Flask服务器，并希望显示node.js的结果（例如，以表格形式在我的网站上抓取），这可能吗？ Will I run into compatibility issues? 我会遇到兼容性问题吗？ Or should I try to stick it out with a python-based approach to scraping like BS4 for the sake of consistency? 还是为了一致性起见，我应该尝试使用像BS4这样的基于python的方法来进行抓取吗？ I ask because node.js is described as a server, so I assume a conflict would arise if I tried to use it and Flask simultaneously. 我问是因为node.js被描述为服务器，所以我假设如果尝试同时使用它和Flask将会发生冲突。

1 个解决方案

If you want to write a web scraper that executes javascript, node.js (with something like Phantom.js) is a great choice. 如果您想编写一个执行javascript的网络抓取工具，node.js（带有Phantom.js之类的东西）是一个不错的选择。 Another popular choice is Selenium. 另一个流行的选择是硒。 You would need to simulate user actions to activate event handlers. 您将需要模拟用户操作来激活事件处理程序。 Let's call this action "scraping". 我们将此动作称为“抓取”。 BS4 would not be appropriate because it cannot execute javascript. BS4不合适，因为它不能执行javascript。

Once you have your data saved to disk, displaying the results in HTML tabular form (let's call this action "reporting") would require yet another solution. 将数据保存到磁盘后，以HTML表格形式显示结果（我们将此操作称为“报告”）将需要另一种解决方案。 Flask is a suitable choice. 烧瓶是合适的选择。

Since the scraping and reporting are separate concerns, no conflict would arise if you wanted to use the two services simultaneously. 由于抓取和报告是分开考虑的，因此，如果您想同时使用两个服务，则不会发生冲突。 When using Selenium or node.js as a scraper, you aren't really running a web server. 当使用Selenium或node.js作为刮板时，您实际上并没有在运行Web服务器。 So it's incorrect to think of it as two web-servers in possible conflict. 因此，将其视为可能存在冲突的两个Web服务器是不正确的。