简体   繁体   English

使用R自动进行网络抓取

[英]Automate webscraping with r

I have managed to scrape content for a single url, but am struggling to automate it for multiple urls. 我设法抓取了单个URL的内容,但是正在努力使多个URL自动执行。

Here how it is done for a single page: 这里是如何为单个页面完成的:

library(XML); library(data.table)
theurl <- paste("http://google.com/",url,"/ul",sep="")
convertUTF <- htmlParse(theurl, encoding = "UTF-8")
tables <- readHTMLTable(convertUTF)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
table <- tables[[which.max(n.rows)]]
TableData <- data.table(table)

Now I have a vector of urls and want to scrape each for the corresponding table: 现在,我有一个网址向量,并希望为对应的表格抓取每个网址:

Here, I read in data comprising multiple http links: 在这里,我读取了包含多个http链接的数据:

ur.l <- data.frame(read.csv(file.choose(), header=TRUE, fill=TRUE))

theurl <- matrix(NA, nrow=nrow(ur.l), ncol=1)
for(i in 1:nrow(ur.l)){
  url <- as.character(ur.l[i, 2])
   }

Each of the three additional urls that you provided refer to pages that contain no tables, so it's not a particularly useful example dataset. 您提供的三个附加url均引用不包含表的页面,因此它不是一个特别有用的示例数据集。 However, a simple way to handle errors is with tryCatch . 但是, tryCatch是处理错误的简单方法。 Below I've defined a function that reads in tables from url u , calculates the number of rows for each table at that url, then returns the table with the most rows as a data.table . 在下面,我定义了一个函数,该函数从url u中读取表,计算该url中每个表的行数,然后将具有最多行的表作为data.table

You can then use sapply to apply this function to each url (or, in your case, each org ID, eg 36245119) in a vector. 然后,您可以使用sapply将此功能应用于sapply中的每个url(或您的情况下的每个组织ID,例如36245119)。

library(XML); library(data.table)
scrape <- function(u) {
  tryCatch({
    tabs <- readHTMLTable(file.path("http://finstat.sk", u, "suvaha"), 
                          encoding='utf-8')
    tab <- tabs[[which.max(sapply(tabs, function(x) nrow(x)))]]
    data.table(tab)  
  }, error=function(e) e)
}

urls <- c('36245119', '46894853', '46892460', '46888721')
res <- sapply(urls, scrape)

Take a look at ?tryCatch if you want to improve the error handling. 如果要改善错误处理,请查看?tryCatch Presently the function simply returns the errors themselves. 目前,该函数本身仅返回错误。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM