简体   繁体   中英

How to scrape id from an element in rvest?

Each div.grpl-grp clearfix (each club element) on this page Has it's own id:

https://uws-community.symplicity.com/index.php?s=student_group

I am trying to scrape each of these ids, however my current method, as shown below does not work. What am I doing wrong?

url <- 'https://uws-community.symplicity.com/index.php?s=student_group'
page <- html_session(url)

id_nodes <- html_nodes(page, "div.grpl-grp clearfix") %>% html_attrs("id")

I need to use HTML session because I'm scraping other data that I need the session for.

There are two changes you need to do in the code.

  1. The class has to be mentioned as "div.grpl-grp.clearfix"
  2. You should use html_attr

     library(rvest) url <- 'https://uws-community.symplicity.com/index.php?s=student_group' page <- html_session(url) html_nodes(page, "div.grpl-grp.clearfix") %>% html_attr("id") #[1] "grpl_5bf9ea61bc46eaeff075cf8043c27c92" #[2] "grpl_17e4ea613be85fe019efcf728fb6361d" #[3] "grpl_d593eb48fe26d58f616515366a1e677b" #[4] "grpl_5b445690da34b7cff962ee2bf254db9e" #[5] "grpl_cd1ebcef22852bdb5301a243803a2909" .... 

Or if you want to do everything in one chain

url %>%
   read_html() %>%
   html_nodes("div.grpl-grp.clearfix") %>%
   html_attr("id")

#[1]"grpl_5bf9ea61bc46eaeff075cf8043c27c92" "grpl_17e4ea613be85fe019efcf728fb6361d"
#[3]"grpl_d593eb48fe26d58f616515366a1e677b" "grpl_5b445690da34b7cff962ee2bf254db9e"
#[5]"grpl_cd1ebcef22852bdb5301a243803a2909" "grpl_0a7da33f968a919ecfa06486f0787bc7"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM