I'm trying to scrape transfermarkt data for private purposes (no commercial use).
In particular, I need information about all transfers for a given time period. It is possible to search for transfers by day, my plan is to search for each day using this page: https://www.transfermarkt.co.uk/transfers/transfertagedetail/statistik/top/plus/0?land_id_ab=&land_id_zu=&leihe=&datum=2000-07-02
I need the table at the bottom of this page.
I'm using rvest
to do it. Here's the code:
library(dplyr)
library(rvest)
url = "http://www.transfermarkt.co.uk/transfers/transfertagedetail/statistik/top/plus/0?land_id_ab=&land_id_zu=&leihe=&datum=2000-07-02"
site = read_html(url)
site %>% html_node("#yw1 td") %>% html_table() %>% View()
I'm getting an error:
Error in open.connection(x, "rb") : HTTP error 404.
This code worked about a year ago, but not now. I've tried to add html_session
, but the result is the same.
Could you please help me?
404 is the http error saying you can't access this page.
If you try this url in your browser (as I did), you'll notice you can't access this url — this is why you can't scrape it.
More on internet http error : https://en.wikipedia.org/wiki/List_of_HTTP_status_codes
You can programmatically check for error with httr
:
GET(url) %>% status_code()
Colin
@Petr
I get a different error than you when I run your code. Instead of the 404 error I get "Error: html_name(x) == "table" is not TRUE"
I was able to set the variable site
properly.
I was also able to use tbls <- html_nodes(site, "table")
to get a list of all tables on the page.
Then using html_table(tbls[6])
I was able to see the last table on the html page of
X1 X2
1 NA José Bosingwa
2 NA Right-Back
Is this what you are looking for? Can you run each of these commands in sequence? If not, where does it fall apart on you?
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.