[英]Web scraping data table in R not working, XML or getURL
Normally I don't have any issues getting table data from sites, but this one is throwing me for a loop.通常我从站点获取表数据没有任何问题,但这个让我陷入困境。
I've tried various suggestion from the site: [ R: Scraping Site, Incrementing Loop by Date in URL, Saving To CSV [ Scraping from aspx website using R [ web scraping in R我已经尝试了来自该站点的各种建议:[ R: Scraping Site, Incrementing Loop by Date in URL, Saving To CSV [ Scraping from aspx website using R [ web scraping in R
I've tried the two methods to try and get something from the site and end up with errors.我已经尝试了这两种方法来尝试从站点获取某些内容,但最终出现错误。
The first approach:第一种方法:
#####Reading in data
library(RCurl)
library(XML)
library(xts)
#pulling rainfall data csv
direct_rainfall <- read.csv(url(getURL("http://cdec.water.ca.gov /cgi-progs/getMonthlyCSV?station_id=CVT&dur_code=M&sensor_num=2&start_date=1/1/2000&end_date=now")))
This ends with the following error: Error in function (type, msg, asError = TRUE) : Failed to connect to cdec.water.ca.gov port 80: Timed out这以以下错误结束:函数错误(类型,味精,asError = TRUE):无法连接到 cdec.water.ca.gov 端口 80:超时
The second method:第二种方法:
#xml data pull method
require(XML)
url = "http://cdec.water.ca.gov/cgi-progs/getMonthlyCSV?station_id=CVT&dur_code=M&sensor_num=2&start_date=1/1/2000&end_date=now"
doc = htmlParse(url)
Which end with the following error: Error: failed to load external entity " http://cdec.water.ca.gov/cgi-progs/getMonthlyCSV?station_id=CVT&dur_code=M&sensor_num=2&start_date=1/1/2000&end_date=now "以以下错误结尾:错误:无法加载外部实体“ http://cdec.water.ca.gov/cgi-progs/getMonthlyCSV?station_id=CVT&dur_code=M&sensor_num=2&start_date=1/1/2000&end_date=now ”
Any guidance would be appreciated.任何指导将不胜感激。 I just can't figure out why I'm getting nothing when I try and pull from the URL.
当我尝试从 URL 中提取时,我无法弄清楚为什么我什么也没得到。
Thanks!谢谢!
If you look at the website, it's a reasonably nicely formatted CSV.如果您查看该网站,它是一个格式相当不错的 CSV 文件。 Happily, if you pass
read.csv
a URL, it will automatically handle the connection for you, so all you really need is:令人高兴的是,如果您向
read.csv
传递一个 URL,它会自动为您处理连接,因此您真正需要的是:
url <- 'http://cdec.water.ca.gov/cgi-progs/getMonthlyCSV?station_id=CVT&dur_code=M&sensor_num=2&start_date=1/1/2000&end_date=now'
df <- read.csv(url, skip = 3, nrows = 17, na.strings = 'm')
df[1:5,1:10]
## X.station. X.sensor. X.year. X.month. X01 X02 X03 X04 X05 X06
## 1 CVT 2 2000 NA 20.90 19.44 3.74 3.31 5.02 0.85
## 2 CVT 2 2001 NA 7.23 9.53 3.86 7.47 0.00 0.15
## 3 CVT 2 2002 NA 3.60 4.43 8.71 2.76 2.78 0.00
## 4 CVT 2 2003 NA 1.71 4.34 4.45 13.45 2.95 0.00
## 5 CVT 2 2004 NA 3.41 10.57 1.80 0.87 0.90 0.00
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.