I'm trying to replicate this tutorial on rvest here . However, at the start I'm already having issues. This is the code I'm using
library(rvest)
#Specifying the url for desired website to be scrapped
url <- 'https://www.nytimes.com/section/politics'
#Reading the HTML code from the website - headlines
webpage <- read_html(url)
headline_data <- html_nodes(webpage,'.story-link a, .story-body a')
My results when I look at headline_data
return
{xml_nodeset (0)}
But in the tutorial it returns a list of length 48
{xml_nodeset (48)}
Any reason for the discrepancy?
As mentioned in the comments, there are no elements with the specified class you are searching for.
To begin, based on current tags you can get headlines with
library(rvest)
library(dplyr)
url <- 'https://www.nytimes.com/section/politics'
url %>%
read_html() %>%
html_nodes("h2.css-l2vidh a") %>%
html_text()
#[1] "Trump’s Secrecy Fight Escalates as Judge Rules for Congress in Early Test"
#[2] "A Would-Be Trump Aide’s Demands: A Jet on Call, a Future Cabinet Post and More"
#[3] "He’s One of the Biggest Backers of Trump’s Push to Protect American Steel. And He’s Canadian."
#[4] "Accountants Must Turn Over Trump’s Financial Records, Lower-Court Judge Rules"
and to get individual URL's of those headlines you could do
url %>%
read_html() %>%
html_nodes("h2.css-l2vidh a") %>%
html_attr("href") %>%
paste0("https://www.nytimes.com", .)
#[1] "https://www.nytimes.com/2019/05/20/us/politics/mcgahn-trump-congress.html"
#[2] "https://www.nytimes.com/2019/05/20/us/politics/kris-kobach-trump.html"
#[3] "https://www.nytimes.com/2019/05/20/us/politics/hes-one-of-the-biggest-backers-of-trumps-push-to-protect-american-steel-and-hes-canadian.html"
#[4] "https://www.nytimes.com/2019/05/20/us/politics/trump-financial-records.html"
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.