繁体   English   中英

当我尝试通过网络抓取表格时,为什么我会在矩阵中出现错误?

[英]Why do I get error in matrix when I try to web scrape a table?

这是我的代码示例。 问题在于第二个链接(对于 Cedar Realty Trust)。

library(rvest)
library(stringr)
library(plyr)
library(dplyr)
library(lubridate)
library(readr)
library(stringi)
library(tidyverse)
library(purrr)

urls <- list(c("CEDAR FAIR L P ", "https://www.sec.gov/Archives/edgar/data/811532/000081153219000037/exhibit212018subsidiaries.htm"),
             c("CEDAR REALTY TRUST, INC.    ", "https://www.sec.gov/Archives/edgar/data/761648/000156459020004590/cdr-ex211_8.htm"),
             c("Celanese Corp ", "https://www.sec.gov/Archives/edgar/data/1306830/000130683020000018/ex211-10k123119.htm"))

List.Of.Tabs <- map(urls, ~ {

  name <- .x[1]
  link <- .x[2]
  Sys.sleep(2)
  webpage <- read_html(link)
  tbls <- html_nodes(webpage, "table")
  tbls_ls <- html_table(tbls, fill = TRUE)
  pos1 <- possibly(function(tbls) bind_rows(tbls) %>% 
                     filter_all(any_vars(. %in% c("Singapore", "SGP"))) %>%
                     mutate(name = name) 
                   , otherwise = NA)

  pos1(tbls_ls)
})

我得到的错误信息:

Error in matrix(NA_character_, nrow = n, ncol = maxp) : 
  invalid 'ncol' value (too large or NA)
In addition: Warning messages:
1: In max(p) : no non-missing arguments to max; returning -Inf
2: In matrix(NA_character_, nrow = n, ncol = maxp) :
  NAs introduced by coercion to integer range

如何修改我的代码以解决此错误?

这是使用tryCatch做到这一点的方法。

library(tidyverse)
library(rvest)

map(urls, ~ {

  name <- .x[1]
  link <- .x[2]
  Sys.sleep(2)
  tryCatch({
     temp <- link %>%
               read_html() %>%
               html_nodes("table") %>%
               html_table(fill = TRUE) 
      map_df(temp, ~filter_all(.x, any_vars(. %in% c("Singapore", "SGP")))) %>%
          mutate(name = name) 
      }, error = function(e) NA
     )
})


#[[1]]
#[1] X1   X2   name
#<0 rows> (or 0-length row.names)

#[[2]]
#[1] NA

#[[3]]
#                                             X1 X2        X3 X4           name
#1                            Celanese PTE. LTD. NA Singapore NA Celanese Corp 
#2  Celanese Singapore Acetyls Holding PTE. LTD. NA Singapore NA Celanese Corp 
#3 Celanese Singapore Chemical Holding PTE. LTD. NA Singapore NA Celanese Corp 
#4                  Celanese Singapore PTE. LTD. NA Singapore NA Celanese Corp 
#5              Celanese Singapore VAM PTE. LTD. NA Singapore NA Celanese Corp 
#6        Celanese Singapore Emulsions PTE. LTD. NA Singapore NA Celanese Corp 

虽然这给出了警告,但它运行时没有错误。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM