简体繁体 English

如何从不使用 POST 的网站抓取信息

[英]How to scrape information from a website that doesn't use POST

原文 2018-09-02 13:11:57 7 1 python/ web-scraping/ scrapy/ html-select

I need to get some information from a website that uses an HTML select to filter its content.我需要从使用 HTML 选择来过滤其内容的网站获取一些信息。 However, i'm having difficulties doing so, since when changing the value from the select, the website does not 'reload' it uses some internal function to do get the new content.但是，我在这样做时遇到了困难，因为从选择更改值时，网站不会“重新加载”它使用一些内部函数来获取新内容。

The webpage in question is this and if I use the Chrome developer tools to see what happens when I change the value of the select.有问题的网页是这样的，如果我使用 Chrome 开发人员工具查看更改选择值时会发生什么。 I get a call looking like this.我接到一个看起来像这样的电话。

index.php?eID=dmmjobcontrol&type=discipline&uid=77&_=1535893178522 index.php?eID=dmmjobcontrol&type=discipline&uid=77&_=1535893178522

Interesting is, that the uid is the id of the option of the select, so we are getting the correct id.有趣的是，uid 是 select 选项的 id，所以我们得到了正确的 id。 However, when I go to this link I just get a page saying null .但是，当我转到此链接时，我只会看到一个页面说null 。

Taking a similar website into account, this one .考虑到类似的网站，这个. When I change the select form there, I get a form data which I can use to get the information I want.当我在那里更改选择表单时，我会得到一个表单数据，我可以用它来获取我想要的信息。

I'm fairly new to scraping and honestly I don't understand how I can get this information.我对抓取还很陌生，老实说，我不明白如何获得这些信息。 If it's for some use I'm using scrapy in python to parse the information from the websites.如果是为了某些用途，我会在 python 中使用 scrapy 来解析来自网站的信息。

1 个解决方案

One solution is to use client layer which executes both: your scraping "script" and all javascript sent by the website, simulating a real browser.一种解决方案是使用客户端层，它同时执行：您的抓取“脚本”和网站发送的所有 javascript，模拟真实的浏览器。 I'm succesfully using PhantomJS for this together with Selenium aka Webdriver API: https://selenium-python.readthedocs.io/getting-started.html我成功地将 PhantomJS 与 Selenium aka Webdriver API 一起使用： https ://selenium-python.readthedocs.io/getting-started.html

Note that historically Selenium was the first product doing that so the name of this API.请注意，从历史上看，Selenium 是第一个这样做的产品，因此这个 API 的名称。 In my opinion PhantomJS is better suited, headless by default (doesn't run any GUI process) and faster.在我看来，PhantomJS 更适合，默认情况下是无头的（不运行任何 GUI 进程）并且速度更快。 Both Selenium and PhantomJS implement a protocol called Webdriver which your Python program would use. Selenium 和 PhantomJS 都实现了一个名为 Webdriver 的协议，您的 Python 程序将使用该协议。

It may sound complicated but please just use Getting Started documentation cited above and check if it suits you.这听起来可能很复杂，但请使用上面引用的入门文档并检查它是否适合您。

EDIT: this article also contains simple example of using the described setup: https://realpython.com/headless-selenium-testing-with-python-and-phantomjs/编辑：本文还包含使用所述设置的简单示例： https : //realpython.com/headless-selenium-testing-with-python-and-phantomjs/

Note that in many articles people do the similar thing for testing, so the term "scraping" is not even mentioned.请注意，在许多文章中，人们为测试做了类似的事情，因此甚至没有提到“抓取”一词。 but technically it's the same - emulating the user clicking in the browser and at the end getting data from specific page elements.但从技术上讲，它是相同的 - 模拟用户在浏览器中单击并最终从特定页面元素获取数据。