简体   繁体   English

如何使用 r 从这个特定网站抓取数据?

[英]How do I scrape data from this specific website using r?

I want to download the data from this website.我想从这个网站下载数据。

http://asphaltoilmarket.com/index.php/state-index-tracker/ http://asphaltoilmarket.com/index.php/state-index-tracker/

But the request keeps getting timed out.但是请求一直超时。

I have tried following methods already, but it keep getting timed out.我已经尝试过以下方法,但它一直超时。

library(rvest)
IndexData <- read_html("http://asphaltoilmarket.com/index.php/state-index-tracker/")
library(RCurl)
IndexData <- getURL("http://asphaltoilmarket.com/index.php/state-index-tracker/")
library(httr)
library(XML)
IndexData <- htmlParse(GET(url))

This website opens in the browser without any problem, and I am able to download this data using excel and alteryx.这个网站在浏览器中打开没有任何问题,我可以使用excel和alteryx下载这些数据。

If by "get the data", you mean "scrape the table on that page", then you just need to go a little further.如果“获取数据”的意思是“抓取该页面上的表格”,那么您只需要更进一步。

First thing, you'll want to check the sites robots.txt to see if scraping is allowed.首先,您需要检查网站robots.txt以查看是否允许抓取。 In this case, there is no mention against scraping.在这种情况下,没有提到反对刮擦。

You've got the html for the site, you just need to find the css selector for what you want.您已经获得了站点的html ,您只需要找到所需的 css 选择器。 You can use developer tools or something like selector gadget to find the table and get its css selector.您可以使用开发人员工具或选择器小工具之类的工具来查找表格并获取其 css 选择器。

After that you use the html, extract the node you're interested in with html_node() then extract the table with html_table() .您所使用的HTML后,提取您正在与感兴趣的节点html_node()然后提取与表html_table()

library(magrittr)
library(rvest)

html <-read_html("http://asphaltoilmarket.com/index.php/state-index-tracker/")

html %>% 
  html_node("#tablepress-5") %>% 
  html_table()
#>             State     Jan     Feb     Mar     Apr     May     Jun     Jul
#> 1         Alabama $496.27 $486.86 $482.16 $498.62 $517.44 $529.20 $536.26
#> 2          Alaska $513.33 $513.33 $513.33 $513.33 $513.33 $525.84 $535.00
#> 3         Arizona $476.00 $469.00 $466.00 $463.00 $470.00 $478.00 $480.00
#> 4        Arkansas $503.50 $500.50 $494.00 $503.00 $516.50 $521.20 $525.00
#> 5      California $305.80 $321.00 $346.20 $365.50 $390.10 $380.50 $345.50
#> 6        Colorado $228.10 $301.45 $320.58 $354.12 $348.70 $277.55 $297.23
#> 7     Connecticut $495.00 $495.00 $495.00 $495.00 $502.50 $502.50 $500.56
#> 8        Delaware $493.33 $458.33 $481.67 $496.67 $513.33 $510.00 $498.33
#> 9         Florida $507.30 $484.32 $487.12 $503.38 $518.52 $517.68 $514.03
#> 10        Georgia $515.00 $503.00 $503.00 $517.00 $534.00 $545.00 $550.00 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM