[英]rvest get html hyperlink in table
我正在嘗試在超鏈接中刪除地理編碼,並希望創建一個包含所有表以及地理編碼的表。
我現在所做的是通過使用以下代碼來獲取表
library(rvest)
url<-"http://www.city-data.com/accidents/acc-Nashua-New-Hampshire.html"
citidata<- html(url)
ta<- citidata %>%
html_nodes("table") %>%
.[1:29] %>%
html_table()
dat<-do.call(rbind, lapply(ta, data.frame, stringsAsFactors=FALSE))
citystate <- citidata %>%
html_node("h1 span") %>%
html_text()
citystate <- gsub("Fatal car crashes and road traffic accidents in ",
"", citystate)
loc<-data.frame(matrix(unlist(strsplit(citystate, ",", fixed = TRUE)), ncol=2, byrow=TRUE))
dat$City<-loc$X1
dat$State<-loc$X2
我懂了
Date,Location,Vehicles,Drunken.persons,Fatalites,Persons,Pedestrians,City,State
1 Jun 26, 2013 87:99 PM, Temple Street, 1, -, 1, 1, -, Nashua, New Hampshire
然后,我嘗試將地理編碼添加到數據框中,但不知道該怎么做。
以下是用於在超鏈接中刪除地理編碼的代碼。
pg <- html("http://www.city-data.com/accidents/acc-Nashua-New-Hampshire.html")
geo <- data.frame(gsub("javascript:showGoogleSView","",pg %>% html_nodes("a") %>% html_attr("href") %>% .[31:60]))
並非所有事件都有相關的經緯度對。 以下代碼使用事件日期(顯然)是唯一的事實,並將坐標與您先前構建的主要dat
合並。
library(rvest)
library(stringr)
library(dplyr)
url <- "http://www.city-data.com/accidents/acc-Nashua-New-Hampshire.html"
# Get all incident tables -------------------------------------------------
citidata <- html(url)
ta <- citidata %>%
html_nodes("table") %>%
.[1:29] %>%
html_table()
# rbind them together -----------------------------------------------------
dat <- do.call(rbind, lapply(ta, data.frame, stringsAsFactors=FALSE))
citystate <- citidata %>%
html_node("h1 span") %>%
html_text()
# Get city/state and add it to the data.frame -------------------------------
citystate <- gsub("Fatal car crashes and road traffic accidents in ",
"", citystate)
loc <- data.frame(matrix(unlist(strsplit(citystate, ",", fixed=TRUE)),
ncol=2, byrow=TRUE))
dat$City <- loc$X1
dat$State <- loc$X2
# Get GPS coords where available ------------------------------------------
coords <- citidata %>%
html_nodes(xpath="//a[@class='showStreetViewLink']") %>%
html_attr("href") %>%
str_extract("([[:digit:]-,\\.]+)") %>%
str_split(",") %>%
unlist() %>%
matrix(ncol=2, byrow=2) %>%
data.frame(stringsAsFactors=FALSE) %>%
rename(lat=X1, lon=X2) %>%
mutate(lat=as.numeric(lat), lon=as.numeric(lon))
# Get GPS coordinates associated incident time for merge ------------------
coord_time <- pg %>%
html_nodes(xpath="//a[@class='showStreetViewLink']/../preceding-sibling::td[1]") %>%
html_text() %>%
data_frame(Date=.)
# Merge the coordinates with the data.frame we built earlier --------------
left_join(dat, bind_cols(coords, coord_time))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.