简体   繁体   中英

Web Scraping Columns from Web with R

I have this website that I want to extract the first 4 columns. But its not working. Im a begginner in WebScraping, any help would be amazing guys:

https://projects.fivethirtyeight.com/2017-nba-predictions/

I want to extract each column: ELO -- CARM-ELO e so on

This is what Ive done so far:

url_nba <- 'https://projects.fivethirtyeight.com/2017-nba-predictions/'

webpage_nba <- read_html(url_nba)

data_nba.1 <- html_nodes(webpage_nba,'.num elo original desktop')
data_nba.2 <- html_nodes(webpage_nba,'.num elo carmelo')

And after this I would like to put in a dataframe.

Any help?

Looking at the HTML code the table is a bit mis-shaped. One approach is to grab then entire table and then collect the Elo scores.

Looking for the css tag "table", three tables were found. Manually looking at each one, table 3 was the one of interest.

library(rvest)
url_nba <- 'https://projects.fivethirtyeight.com/2017-nba-predictions/'
webpage_nba <- read_html(url_nba)

#collect the tables from the page
tables <- html_nodes(webpage_nba,'table')

#Process the table of interest (returns a list of 1)
resultdf <- tables[3] %>% html_table(fill=TRUE)
resultdf <- resultdf[[1]]

The variable "resultdf" is a Dataframe of the table of interest. Since there were some hidden fields and other non standard information, some clean up is needed to remove the first couple of rows, rename the columns and remove the blank columns.

As you get equal length lists of nodes you can use css selectors to target which ever columns you want and cbind then convert to dataframe as an alternative. The selectors below lead to a clean output dataframe.

library(rvest)
library(magrittr)

page <- read_html('https://projects.fivethirtyeight.com/2017-nba-predictions/')
df <- setNames(data.frame(cbind(
  html_text(html_nodes(page, 'td.original')),
  html_text(html_nodes(page, 'td.carmelo')),
  html_text(html_nodes(page, '.change')),
  html_text(html_nodes(page, '.team a'))
)),c('elo','carmelo','1wkchange','team'))

print(df)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM