简体   繁体   中英

R, web scraping, rvest, transfermarkt data

I'm trying to scrape transfermarkt data for private purposes (no commercial use).

In particular, I need information about all transfers for a given time period. It is possible to search for transfers by day, my plan is to search for each day using this page: https://www.transfermarkt.co.uk/transfers/transfertagedetail/statistik/top/plus/0?land_id_ab=&land_id_zu=&leihe=&datum=2000-07-02

I need the table at the bottom of this page.

I'm using rvest to do it. Here's the code:

library(dplyr)
library(rvest)

url = "http://www.transfermarkt.co.uk/transfers/transfertagedetail/statistik/top/plus/0?land_id_ab=&land_id_zu=&leihe=&datum=2000-07-02"
site =  read_html(url)
site %>% html_node("#yw1 td") %>% html_table() %>% View()

I'm getting an error:

Error in open.connection(x, "rb") : HTTP error 404.

This code worked about a year ago, but not now. I've tried to add html_session , but the result is the same.

Could you please help me?

404 is the http error saying you can't access this page.

If you try this url in your browser (as I did), you'll notice you can't access this url — this is why you can't scrape it.

More on internet http error : https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

You can programmatically check for error with httr :

GET(url) %>% status_code()

Colin

@Petr

I get a different error than you when I run your code. Instead of the 404 error I get "Error: html_name(x) == "table" is not TRUE"

I was able to set the variable site properly.

I was also able to use tbls <- html_nodes(site, "table") to get a list of all tables on the page.

Then using html_table(tbls[6]) I was able to see the last table on the html page of

  X1            X2
1 NA José Bosingwa
2 NA    Right-Back

Is this what you are looking for? Can you run each of these commands in sequence? If not, where does it fall apart on you?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM