简体   繁体   中英

R - form web scraping with rvest

First I'd like to take a moment and thank the SO community, You helped me many times in the past without me needing to even create an account.

My current problem involves web scraping with R. Not my strong point.

I would like to scrap http://www.cbs.dtu.dk/services/SignalP/

what I have tried:

    library(rvest)
    url <- "http://www.cbs.dtu.dk/services/SignalP/"
    seq <- "MTSKTCLVFFFSSLILTNFALAQDRAPHGLAYETPVAFSPSAFDFFHTQPENPDPTFNPCSESGCSPLPVAAKVQGASAKAQESDIVSISTGTRSGIEEHGVVGIIFGLAFAVMM"

    session <- rvest::html_session(url)
    form <- rvest::html_form(session)[[2]]
    form <- rvest::set_values(form, `SEQPASTE` = seq)
    form_res_cbs <- rvest::submit_form(session, form)
    #rvest prints out:
    Submitting with 'trunc'

rvest::html_text(rvest::html_nodes(form_res_cbs, "head")) 
#ouput:
"Configuration error"

rvest::html_text(rvest::html_nodes(form_res_cbs, "body"))

#ouput:
"Exception:WebfaceConfigErrorPackage:Webface::service : 358Message:Unhandled #parameter 'NULL' in form "

I am unsure what is the unhandled parameter. Is the problem in the submit button? I can not seem to force:

form_res_cbs <- rvest::submit_form(session, form, submit = "submit")
#rvest prints out
Error: Unknown submission name 'submit'.
Possible values: trunc

is the problem the submit$name is NULL?

form[["fields"]][[23]] 

I tried defining the fake submit button as suggested here: Submit form with no submit button in rvest

with no luck.

I am open to solutions using rvest or RCurl/httr, I would like to avoid using RSelenium

EDIT: thanks to hrbrmstr awesome answer I was able to build a function for this task. It is available in the package ragp: https://github.com/missuse/ragp

Well, this is doable. But it's going to require elbow grease.

This part:

library(rvest)
library(httr)
library(tidyverse)

POST(
  url = "http://www.cbs.dtu.dk/cgi-bin/webface2.fcgi",
  encode = "form",
  body=list(
    `configfile` = "/usr/opt/www/pub/CBS/services/SignalP-4.1/SignalP.cf",
    `SEQPASTE` = "MTSKTCLVFFFSSLILTNFALAQDRAPHGLAYETPVAFSPSAFDFFHTQPENPDPTFNPCSESGCSPLPVAAKVQGASAKAQESDIVSISTGTRSGIEEHGVVGIIFGLAFAVMM",
    `orgtype` = "euk",
    `Dcut-type` = "default",
    `Dcut-noTM` = "0.45",
    `Dcut-TM` = "0.50",
    `graphmode` = "png",
    `format` = "summary",
    `minlen` = "",
    `method` = "best",
    `trunc` = ""
  ),
  verbose()
) -> res

Makes the request you made. I left verbose() in so you can watch what happens. It's missing the "filename" field, but you specified the string, so it's a good mimic of what you did.

Now, the tricky part is that it uses an intermediary redirect page that gives you a chance to enter an e-mail address for notification when the query is done. It does do a regular (every ~10s or so) check to see if the query is finished and will redirect quickly if so.

That page has the query id which can be extracted via:

content(res, as="parsed") %>% 
  html_nodes("input[name='jobid']") %>% 
  html_attr("value") -> jobid

Now, we can mimic the final request, but I'd add in a Sys.sleep(20) before doing so to ensure the report is done.

GET(
  url = "http://www.cbs.dtu.dk/cgi-bin/webface2.fcgi",
  query = list(
    jobid = jobid,
    wait = "20"
  ),
  verbose()
) -> res2

That grabs the final results page:

html_print(HTML(content(res2, as="text")))

在此输入图像描述

You can see images are missing because GET only retrieves the HTML content. You can use functions from rvest / xml2 to parse through the page and scrape out the tables and the URLs that you can then use to get new content.

To do all this, I used burpsuite to intercept a browser session and then my burrp R package to inspect the results. You can also visually inspect in burpsuite and build things more manually.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM