简体   繁体   中英

Web scraping an embedded table using R

I am currently working a project to scrape the content of the Performance Characteristics table on this website

https://www.ishares.com/uk/individual/en/products/251795/ishares-ftse-100-ucits-etf-inc-fund

The data I am wanting to extract from this table is the 12 m trailing yield of 3.43%

The code I wrote to do this is:

url <- "https://www.ishares.com/uk/individual/en/products/251795/ishares-ftse-100-ucits-etf-inc-fund"
etf_Data <- url %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="fundamentalsAndRisk"]/div') %>%
  html_table()
etf_Data <- etf_Data[[1]]

which provided me with an empty list with the error message 'Error in etf_Data[[1]]: subscript out of bounds'

Using Google inspect I have tried various XPaths including reading it in html_text:

url <- "https://www.ishares.com/uk/individual/en/products/251795/ishares-ftse-100-ucits-etf-inc-fund"
etf_Data <- url %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="fundamentalsAndRisk"]/div/div[4]/span[2]') %>%
  html_text()
etf_Data <- etf_Data[[1]]

However with no success.

Having gone through other Stack Overflow responses I have not been able to solve my issue.

Would someone be able to assist.

Thank you C

Couple of things:

  1. There is a different URI you end up at in order to get the content you want. This comes when you manually accept certain conditions on the page
  2. The data you want is not within a table

You can add a queryString with EntryPassthrough parameter = True to get to the right URI and then use:contains and an adjacent sibling combinator to get the desired value

library(rvest)
library(magrittr)

url <- "https://www.ishares.com/uk/individual/en/products/251795/ishares-ftse-100-ucits-etf-inc-fund?switchLocale=y&siteEntryPassthrough=true"
trailing_12m_yield <- url %>%
  read_html() %>%
  html_element('.caption:contains("12m Trailing Yield") + .data') %>% html_text2()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM