简体繁体 English

提交表单并加载数据后抓取网站

[英]Scrape website after form submit and data is loaded

原文 2020-07-12 11:26:26 5 1 javascript/ web-scraping/ scrapy/ phantomjs/ cheerio

I have to scrape a website which i've reviewed and i realised that i don't need to submit any form.我必须抓取一个我已经审查过的网站，我意识到我不需要提交任何表格。 I have the needed urls to get the data.我有获取数据所需的网址。 I'm using NodeJs and Phantom .我正在使用NodeJs和Phantom 。

My problems source is something related with the session or cookies (i think).我的问题来源与 session 或 cookies 相关（我认为）。

In my web browser i can enter in this link https://www.infosubvenciones.es/bdnstrans/GE/es/convocatorias , hit on the form blue button with text "Procesar consulta".在我的 web 浏览器中，我可以在此链接中输入https://www.infosubvenciones.es/bdnstrans/GE/es/convocatorias ，点击带有文本“Procesar Consulta”的表单蓝色按钮。 The table below will be filled.下表将被填写。 In dev tools on network tab you can see a XHR request with a link similar to https://www.infosubvenciones.es/bdnstrans/busqueda?type=convs&_search=false&nd=1594848133517&rows=50&page=1&sidx=4&sord=desc , if you open it in a new tab, the data is displayed.在网络选项卡上的开发工具中，您可以看到带有类似于https://www.infosubvenciones.es/bdnstrans/busqueda?type=convs&_search=false&nd=1594848133517&rows=50&page=1&sidx=4&sord=desc的链接的 XHR 请求它在新选项卡中显示数据。 But if you open that link in other web browser you get 0 results.但是，如果您在其他 web 浏览器中打开该链接，您将获得 0 个结果。

That's exactly what is happening to me with NodeJs and Phantom and i don't know how to fix it.这正是 NodeJs 和 Phantom 发生在我身上的事情，我不知道如何解决它。

1 个解决方案

If you want to give Scrapy a try, https://docs.scrapy.org/en/latest/topics/dynamic-content.html explains how to deal with this type of scenarios, and I would suggest reading it after completing the tutorial. If you want to give Scrapy a try, https://docs.scrapy.org/en/latest/topics/dynamic-content.html explains how to deal with this type of scenarios, and I would suggest reading it after completing the tutorial .

The page can also be handy if you use other scraping framework, as there's not much that is Scrapy-specific, and for Python-specific stuff I'm sure there will be JavaScript counterparts.如果您使用其他抓取框架，该页面也很方便，因为没有太多特定于 Scrapy 的内容，而对于特定于 Python 的东西，我相信会有 JavaScript 对应物。

As for Cheerio and Phantom, I'm not familiar with them, but it is most likely doable with them as well.至于 Cheerio 和 Phantom，我对它们并不熟悉，但它们也很可能是可行的。

It's doable with any web client, it's just a matter of knowing how to use the tool for this purpose.它适用于任何 web 客户端，只需知道如何使用该工具即可。 Most of the work involves using your web browser tools to understand how the website works underneath.大部分工作涉及使用您的 web 浏览器工具来了解网站在下面的工作方式。