簡體   English   中英

rvest獲取表中的html超鏈接

[英]rvest get html hyperlink in table

我正在嘗試在超鏈接中刪除地理編碼,並希望創建一個包含所有表以及地理編碼的表。

我現在所做的是通過使用以下代碼來獲取表

library(rvest)

url<-"http://www.city-data.com/accidents/acc-Nashua-New-Hampshire.html"

citidata<- html(url)
ta<- citidata %>%
html_nodes("table") %>%
.[1:29] %>%
html_table()

dat<-do.call(rbind, lapply(ta, data.frame, stringsAsFactors=FALSE))

citystate <- citidata %>%
 html_node("h1 span") %>%
 html_text()

citystate <- gsub("Fatal car crashes and road traffic accidents in ",
                  "", citystate)

loc<-data.frame(matrix(unlist(strsplit(citystate, ",", fixed = TRUE)), ncol=2, byrow=TRUE))
dat$City<-loc$X1
dat$State<-loc$X2

我懂了

Date,Location,Vehicles,Drunken.persons,Fatalites,Persons,Pedestrians,City,State
1 Jun 26, 2013 87:99 PM, Temple Street, 1, -, 1, 1, -, Nashua, New Hampshire

然后,我嘗試將地理編碼添加到數據框中,但不知道該怎么做。

以下是用於在超鏈接中刪除地理編碼的代碼。

pg <- html("http://www.city-data.com/accidents/acc-Nashua-New-Hampshire.html")
geo <- data.frame(gsub("javascript:showGoogleSView","",pg %>% html_nodes("a") %>% html_attr("href") %>% .[31:60]))

並非所有事件都有相關的經緯度對。 以下代碼使用事件日期(顯然)是唯一的事實,並將坐標與您先前構建的主要dat合並。

library(rvest)
library(stringr)
library(dplyr)

url <- "http://www.city-data.com/accidents/acc-Nashua-New-Hampshire.html"

# Get all incident tables -------------------------------------------------

citidata <- html(url)

ta <- citidata %>%
  html_nodes("table") %>%
  .[1:29] %>%
  html_table()

# rbind them together -----------------------------------------------------

dat <- do.call(rbind, lapply(ta, data.frame, stringsAsFactors=FALSE))

citystate <- citidata %>%
  html_node("h1 span") %>%
  html_text()

# Get city/state and add it to the data.frame -------------------------------

citystate <- gsub("Fatal car crashes and road traffic accidents in ", 
                  "", citystate)

loc <- data.frame(matrix(unlist(strsplit(citystate, ",", fixed=TRUE)), 
                         ncol=2, byrow=TRUE))

dat$City <- loc$X1
dat$State <- loc$X2

# Get GPS coords where available ------------------------------------------

coords <- citidata %>% 
  html_nodes(xpath="//a[@class='showStreetViewLink']") %>% 
  html_attr("href") %>% 
  str_extract("([[:digit:]-,\\.]+)") %>% 
  str_split(",") %>% 
  unlist() %>% 
  matrix(ncol=2, byrow=2) %>% 
  data.frame(stringsAsFactors=FALSE) %>% 
  rename(lat=X1, lon=X2) %>% 
  mutate(lat=as.numeric(lat), lon=as.numeric(lon))

# Get GPS coordinates associated incident time for merge ------------------

coord_time <- pg %>% 
  html_nodes(xpath="//a[@class='showStreetViewLink']/../preceding-sibling::td[1]") %>%
  html_text() %>% 
  data_frame(Date=.)

# Merge the coordinates with the data.frame we built earlier --------------

left_join(dat, bind_cols(coords, coord_time))

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM