简体   繁体   中英

How to scrape information from a website that doesn't use POST

I need to get some information from a website that uses an HTML select to filter its content. However, i'm having difficulties doing so, since when changing the value from the select, the website does not 'reload' it uses some internal function to do get the new content.

The webpage in question is this and if I use the Chrome developer tools to see what happens when I change the value of the select. I get a call looking like this.

index.php?eID=dmmjobcontrol&type=discipline&uid=77&_=1535893178522

Interesting is, that the uid is the id of the option of the select, so we are getting the correct id. However, when I go to this link I just get a page saying null .

Taking a similar website into account, this one . When I change the select form there, I get a form data which I can use to get the information I want.

I'm fairly new to scraping and honestly I don't understand how I can get this information. If it's for some use I'm using scrapy in python to parse the information from the websites.

One solution is to use client layer which executes both: your scraping "script" and all javascript sent by the website, simulating a real browser. I'm succesfully using PhantomJS for this together with Selenium aka Webdriver API: https://selenium-python.readthedocs.io/getting-started.html

Note that historically Selenium was the first product doing that so the name of this API. In my opinion PhantomJS is better suited, headless by default (doesn't run any GUI process) and faster. Both Selenium and PhantomJS implement a protocol called Webdriver which your Python program would use.

It may sound complicated but please just use Getting Started documentation cited above and check if it suits you.

EDIT: this article also contains simple example of using the described setup: https://realpython.com/headless-selenium-testing-with-python-and-phantomjs/

Note that in many articles people do the similar thing for testing, so the term "scraping" is not even mentioned. but technically it's the same - emulating the user clicking in the browser and at the end getting data from specific page elements.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM