简体   繁体   中英

How to scrape multiple pages of search results using R

I'd like to scrape a website which lists all the qualifications in South Africa ( http://allqs.saqa.org.za/search.php )

When you first go to the link you will note that its a page with search criteria. I want to scrape all the results so I don't enter anything in the search criteria - just click "GO" which then returns the search results which I want to scrape. The results are displayed for 20 records and there are 16521 pages of results. At this stage the URL is still as above.

Is it possible to scrape these results? From the online searching I've been doing I've found solutions to where you the page results search criteria are defined in the URL. However for the site I want to scrape this is not option

Ideally I'd like to use R to do the scraping, however I'm open to other suggestions if its not possible in R

Many thanks Ria

R has a curl package that supports the POST method. The following code should get you started:

library(curl)
h = new_handle()
handle_setopt(h, copypostfields = "cat=qual&GO=Go")
req = curl_fetch_memory("http://allqs.saqa.org.za/search.php", handle=h)
cat(rawToChar(req$content))

Note this just spits out the content of the web page after submitting the form. Parsing the data into a dataframe is left as an exercise. Type "??curl" in R to see a tutorial.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM