简体   繁体   中英

R Web scraping coinmarketcap with rvest

I'm trying to get a table from coinmarketcap.com using the rvest -package.

A solution approach is shown below. However, this one does not work anymore. The resulting table is empty. Apparently, the website has been changed somehow.

Can anyone provide a solution?

Many thanks in advance!

library(rvest)
library(tidyverse)
library(xml2)

url<- "https://coinmarketcap.com/currencies/bitcoin/historical-data/"

table <- url %>% 
  read_html()%>%
  html_table() %>% 
  as.data.frame()

The webpage loads dynamically now. You thus need to use RSelenium and not just rvest .

This code works for me:

url<- "https://coinmarketcap.com/currencies/bitcoin/historical-data/"

# RSelenium with Firefox
rD <- RSelenium::rsDriver(browser="firefox", port=4546L, verbose=F)
remDr <- rD[["client"]]
remDr$navigate(url)
Sys.sleep(4)

# get the page source
web <- remDr$getPageSource()
web <- xml2::read_html(web[[1]])

table <- html_table(web) %>%
  as.data.frame()

# close RSelenium
remDr$close()
gc()
rD$server$stop()
system("taskkill /im java.exe /f", intern=FALSE, ignore.stdout=FALSE)

You don't need to overhead of a browser. You can mimic the API call and parse the json response.

library(jsonlite)
library(tidyverse)

data <-jsonlite::read_json('https://web-api.coinmarketcap.com/v1/cryptocurrency/ohlcv/historical?id=1&convert=USD&time_start=1614297600&time_end=1619395200')$data$quotes
df <- map_df(data, function(x) {data.frame(x$quote)})
print(df)

# 1614297600 is Fri Feb 26 2021 00:00:00 GMT+0000     for 2021-02-27
# 1619395200  Mon Apr 26 2021 00:00:00 GMT+0000       for  2021-04-25

The time_start and end_start are unix timestamp with what looks like a day offset though you will need to explore how this works and whether offsets vary across bank holidays/weekends.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM