简体   繁体   中英

Rvest: Headlines returning empty list

I'm trying to replicate this tutorial on rvest here . However, at the start I'm already having issues. This is the code I'm using

library(rvest)
#Specifying the url for desired website to be scrapped
url <- 'https://www.nytimes.com/section/politics'

#Reading the HTML code from the website - headlines
webpage <- read_html(url)
headline_data <- html_nodes(webpage,'.story-link a, .story-body a')

My results when I look at headline_data return

{xml_nodeset (0)}

But in the tutorial it returns a list of length 48

{xml_nodeset (48)}

Any reason for the discrepancy?

As mentioned in the comments, there are no elements with the specified class you are searching for.

To begin, based on current tags you can get headlines with

library(rvest)
library(dplyr)
url <- 'https://www.nytimes.com/section/politics'

url %>%
  read_html() %>%
  html_nodes("h2.css-l2vidh a") %>%
  html_text()

#[1] "Trump’s Secrecy Fight Escalates as Judge Rules for Congress in Early Test"                    
#[2] "A Would-Be Trump Aide’s Demands: A Jet on Call, a Future Cabinet Post and More"               
#[3] "He’s One of the Biggest Backers of Trump’s Push to Protect American Steel. And He’s Canadian."
#[4] "Accountants Must Turn Over Trump’s Financial Records, Lower-Court Judge Rules"             

and to get individual URL's of those headlines you could do

url %>%
  read_html() %>%
  html_nodes("h2.css-l2vidh a") %>%
  html_attr("href") %>%
  paste0("https://www.nytimes.com", .)

#[1] "https://www.nytimes.com/2019/05/20/us/politics/mcgahn-trump-congress.html"                                                                   
#[2] "https://www.nytimes.com/2019/05/20/us/politics/kris-kobach-trump.html"                                                                       
#[3] "https://www.nytimes.com/2019/05/20/us/politics/hes-one-of-the-biggest-backers-of-trumps-push-to-protect-american-steel-and-hes-canadian.html"
#[4] "https://www.nytimes.com/2019/05/20/us/politics/trump-financial-records.html"      

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM