简体   繁体   中英

Web Scraping Image URL for a series of events in ESPN Play-By-Play

I am trying to use web scraping to generate a play by play dataset from ESPN. I have figured out most of it, but have been unable to tell which team the event is for, as this is only encoded on ESPN in the form of an image. The best way I have come up with to solve this problem is to get the URL of the logo for each entry and compare it to the URL of the logo for each team at the top of the page. However, I have been unable to figure out how to get an attribute such as the url from the image.

I am running this on R and am using the rvest package. The url I am scraping is https://www.espn.com/mens-college-basketball/playbyplay?gameId=400587906 and I am scraping using the SelectorGadget Chrome extension. I have also tried comparing the name of the player to the boxscore, which has all of the players listed, but each team has a player with the last name of Jones, so I would prefer to be able to get the team by looking at the image, as this will always be right.

library(rvest)
url <- "https://www.espn.com/mens-college-basketball/playbyplay?gameId=400587906"
webpage <- read_html(url)


# have been able to successfully scrape game_details and score
game_details_html <- html_nodes(webpage,'.game-details')
game_details <- html_text(game_details_html) %>% as.character()

score_html <- html_nodes(webpage,'.combined-score')
score <- html_text(score_html)


# have not been able to scrape image    
ImgNode <- html_nodes(webpage, css = "#gp-quarter-1 .team-logo")
link <- html_attr(ImgNode, "src")

For each event, I want it to be labeled "Duke" or "Wake Forest".

Is there a way to generate the URL for each image? Any help would be greatly appreciated.

" https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/150.png&h=100&w=100 " " https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/154.png&h=100&w=100 "

Your code returns these. 500/150 is Duke and 500/154 is Wake Forest. You can create a simple dataframe with these and then join the tables.

link_df <- as.data.frame(link)

link_ref_df <- data.frame(link = c("https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/150.png&h=100&w=100", "https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/154.png&h=100&w=100"),
                             team_name = c("Duke", "Wake Forest"))

link_merged <- merge(link_df, 
                    link_ref_df,
                    by = 'link',
                    all.x = T)

This is not scalable if you're doing hundreds of these with other teams, but works for this specific option.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM