简体   繁体   中英

web scraping - No records found

I'm trying to rbind series of HTML Tables (from different pages with same col names) but some pages have "no records" , I want to skip such pages or assign NULL to the dataframe.

Example Dataframe 1

url="http://stats.espncricinfo.com/ci/engine/player/28081.html?class=2;filter=advanced;floodlit=1;innings_number=1;orderby=start;result=1;template=results;type=batting;view=match"

Batting=readHTMLTable(url)

Batting$"Match by match list"

Batting<-Batting$"Match by match list"

Dataframe 2

    url="http://stats.espncricinfo.com/ci/engine/player/625383.html?class=2;filter=advanced;floodlit=1;innings_number=1;orderby=start;result=2;template=results;type=batting;view=match"



Batting=readHTMLTable(url)

Batting$"Match by match list"

Batting<-Batting$"Match by match list"

There are several such Dataframes which have records in tabular form and some that don't have records

When I rbind the one with no records is causing error for final dataframe

final_DF<-rbind(Dataframe1,Dataframe2)

How do I resolve this!?

PS: And for each url query I'm adding certain set of columns(say 5 additional columns using cbind) based on my requirement to the dataframe.

You can do the following:

require(rvest)
require(tidyverse)

urls <- c(
  "http://stats.espncricinfo.com/ci/engine/player/28081.html?class=2;filter=advanced;floodlit=1;innings_number=1;orderby=start;result=1;template=results;type=batting;view=match",
  "http://stats.espncricinfo.com/ci/engine/player/625383.html?class=2;filter=advanced;floodlit=1;innings_number=1;orderby=start;result=2;template=results;type=batting;view=match"
)

extra_cols <- list(
  tibble("Team"="IND","Player"="MS.Dhoni","won"=1,"lost"=0,"D"=1,"D/N"=0,"innings"=1,"Format"="ODI"),
  tibble("Team"="IND","Player"="MS.Dhoni","won"=1,"lost"=0,"D"=1,"D/N"=0,"innings"=1,"Format"="ODI")
)

doc <- map(urls, read_html) %>% 
  map(html_node, ".engineTable:nth-child(5)")

keep <- map_lgl(doc, ~class(.) != "xml_missing")

map(doc[keep], html_table, fill = TRUE) %>% 
  map2_df(extra_cols[keep], cbind)

The critical part is the discard which removes all list-elements of class "xml_missing" eg the empty ones.

I comparison to your code i use CSS selector to specify the html_node that should inherit the table. See http://selectorgadget.com/

Also your rbind is done internally by map2_df (the last row)

This results in: (using %>% {head(.[,c("Bat1", "Runs", "Team")])} )

  Bat1 Runs Team
1    0    0  IND
2    3    3  IND
3  148  148  IND
4   56   56  IND
5   38   38  IND
6   20   20  IND

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM