简体   繁体   中英

R web scraping, download data from a web application

Is there a way to access a password protected web site through R like this: https://www.npddecisionkey.com/sso/#login/applications/decisionkey ? I inspected source code of the page, but not able to find the places for user name and password.

How did you inspect the HTML? I ask b/c this:

<input id="textfield-1022-inputEl" data-ref="inputEl" type="text" size="1" name="userName" placeholder="Username" role="textbox" aria-hidden="false" aria-disabled="false" aria-readonly="false" aria-invalid="true" aria-required="true" class="x-form-field x-form-required-field x-form-text x-form-text-field-noborder  x-form-invalid-field x-form-invalid-field-field-noborder x-form-empty-field x-form-empty-field-field-noborder" autocomplete="ON" data-componentid="textfield-1022" aria-describedby="textfield-1022-ariaErrorEl">

is the username input field and this:

<input id="textfield-1023-inputEl" data-ref="inputEl" type="password" size="1" name="password" placeholder="Password" role="textbox" aria-hidden="false" aria-disabled="false" aria-readonly="false" aria-invalid="true" aria-required="true" class="x-form-field x-form-required-field x-form-text x-form-text-field-noborder  x-form-invalid-field x-form-invalid-field-field-noborder x-form-empty-field x-form-empty-field-field-noborder" autocomplete="ON" data-componentid="textfield-1023" aria-describedby="textfield-1023-ariaErrorEl">

is the password input field and this:

<form class="x-panel x-center-layout-item x-panel-indented" style="padding: 30px 0px 0px; width: 315px; right: auto; left: 0px; top: 0px; margin: 0px; height: 373px;" method="post" role="presentation" id="auth-login-1018">

is the start of the form.

You should consider using rvest::html_session() or RSelenium with this site. The former will be good if there aren't many dynamic elements on the pages and will preserve the session cookies that will be generated after the login. The latter will be good if there are non-XHR dynamic element on the site. Consider using rvest::submit_form() to after establishing the initial session and setting the form parameters if you attempt the rvest solution.

The verbose markup on the vast majority of tags leads me to believe they may be using a js framework or two that attempt to be dynamic which may mean you will be forced to use RSelenium .

Here is an approach based on internet explorer:

library(RDCOMClient)
url <- "https://www.npddecisionkey.com/sso/#login/applications/decisionkey"
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
IEApp$Navigate(url)
Sys.sleep(5)
doc <- IEApp$Document()

clickEvent <- doc$createEvent("MouseEvent")
clickEvent$initEvent("click", TRUE, FALSE)

web_Obj1 <- doc$querySelector("#rawUserInput")
web_Obj1[["Value"]] <- "myusername"

web_Obj2 <- doc$querySelector("#continue")
web_Obj2$dispatchEvent(clickEvent)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM