简体   繁体   中英

Cannot Web Scrape Text Box with R Studio Using Rvest

I am trying to scrape the text box found at the bottom of this page under the "text" tab. However, I have spent a long time trying to figure how to do so but no luck so far. Here is my code:

link <- "https://exploreuk.uky.edu/catalog/xt7t1g0hx952#page/1/mode/1up"
page <- read_html(link)
text <- page %>% html_elements("#text_frame") %>% html_text()

I used gadget selector to select the text but I only get "" as the output. Can anyone please help me with this problem?

TIA

This is dynamically rendered content, and it cannot be scraped with conventional html_elements methods. Here is a way to get the JavaScript text with RSelenium :

library(wdman)
library(RSelenium)

selServ <- selenium(
  port = 4444L,
  version = 'latest',
  chromever = '103.0.5060.134', # set to available
)

remDr <- remoteDriver(
  remoteServerAddr = 'localhost',
  port = 4444L,
  browserName = 'chrome'
)

remDr$open()

link <- "https://exploreuk.uky.edu/catalog/xt7t1g0hx952#page/1/mode/1up"

remDr$navigate(link)
text_button <- remDr$findElement("xpath", "/html/body/div[1]/section[2]/div[2]/div[3]/div[1]/ul/li[2]")
remDr$mouseMoveToLocation(webElement = text_button)
text_button$click()
iframe <- remDr$findElement("xpath", "/html/body/div[1]/section[2]/div[2]/div[3]/div[3]/iframe")
remDr$switchToFrame(iframe)
all_text <- remDr$findElement("xpath", "html/body/pre")
all_text$getElementText()

#First few rows:

# . • . .·»  ’ ' -· 7.\n4 4.,.- ——  ..\"..`....,...;’:\"·.·»~——··_,__ ·\n4 .  ..,....,·;;`,».—g;C3,:_·:r:;,;;;;;~...~.`.e;:~.g\n. ,`,:.........  · >¢# fF$?Z;;r;:: wi Ti - g; r;r;:::.\"‘·;_~··· : \nmw.- ‘  ..-- :1g?;;::::\".-gg-_;c::§;§;;;z;r:r:9i;fi1:;:;::§§;‘};·—-..-2..•··\n : . -- -—  ~.. j‘_‘&:&'i\"'”\"r:.x:‘:.  r»=r§i?:`k~<·`¥,¥?£21iT3:!2§§3:::::&_—z.;:r§;§55:€iZ;;;;?£é§5;i—.;.·;,;5;>E;;;;>§;:$*/mx\n- · ‘ ··=:¢ J ¤· T7’¥:—r:<:7Z`·€:`???:1<¤`*?Z2€f!f¤T*§::1:1¤1‘?:1:=:fFi*’£:<:f¤fEE·:;:¢F§?é:=::   .-==;;$?k::¢¢¢§;>.;- T¥7;`?§ ;§~·~,;;:;:>;5;;;:;.·:3=¤‘7.;; 1;:::\n.·g · ;;;;;::i-··-~·:;::.(·7·~·;;::--—·;1;::r:··~-1;:.·»77~¢:;:1-·—;;::»·—·gn.-r·~;;;;:,~;·;:;:..·;;;:::-qv;·;r.·:-.-»·-;;;:;....—-··

You might have to do a little extra work setting up RSelenium , such as installing a driver etc. Let me know if this works!

Here is a post which describes some of the switching to default content frame logic:

What does #document mean?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM