简体   繁体   English

在制表器 package 中使用 extract_tables() function 时遇到问题:

[英]Trouble using extract_tables() function in tabulizer package:

I am trying to scrape tables from a PDF but from my local directory rather than from a web-browser (as it is not opening directly into a browser).我正在尝试从PDF但从我的本地目录而不是从网络浏览器(因为它没有直接打开到浏览器)中抓取表格。 Yet, I download the pdf onto my local directory and trying from there to read my tables only!然而,我将 pdf 下载到我的本地目录并尝试从那里仅读取我的表!

When I run my code:当我运行我的代码时:

PATH <-"C:\\Users\\gabrielburcea\\Rprojects\\Reports_scraping\\data_scraped\\icnarc_29052020\\icnarc_200529.pdf"  
test <- extract_tables(PATH, output = "data.frame", pages = c(10, 11))

I get the following error which I cannot find anywhere on internet:我收到以下错误,我在互联网上的任何地方都找不到:

Error in .jcall("RJavaTools", "Ljava/lang/Object;", "newInstance", .jfindClass(class),  : 
java.io.FileNotFoundException: /Library/Frameworks/R.framework/Versions/3.6/Resources/library (Is a directory)

Is there a way to solve this issue?有没有办法解决这个问题?

The .pdf I am trying to scrape has been downloaded to my computer from the this website .我试图抓取.pdf已从该网站下载到我的计算机上。
The report is titled ICNARC COVID-19 report 2020-05-29.pdf and can be downloaded using the link of the right-side of the page.该报告的标题为ICNARC COVID-19 report 2020-05-29.pdf ,可以使用页面右侧的链接下载。

Below is the output of traceback() after I've received my error message.以下是我收到错误消息后traceback()的 output。

   8: stop(list(message = "java.io.FileNotFoundException: /Library/Frameworks/R.framework/Versions/3.6/Resources/library (Is a directory)", 
           call = .jcall("RJavaTools", "Ljava/lang/Object;", "newInstance", 
               .jfindClass(class), .jarray(p, "java/lang/Object", dispatch = FALSE), 
               .jarray(pc, "java/lang/Class", dispatch = FALSE), evalString = FALSE, 
               evalArray = FALSE, use.true.class = TRUE), jobj = new("jobjRef", 
               jobj = <pointer: 0x7fd1ba0972b0>, jclass = "java/io/FileNotFoundException")))
    7: .jcheck(silent = FALSE)
    6: .jcall("RJavaTools", "Ljava/lang/Object;", "newInstance", .jfindClass(class), 
           .jarray(p, "java/lang/Object", dispatch = FALSE), .jarray(pc, 
               "java/lang/Class", dispatch = FALSE), evalString = FALSE, 
           evalArray = FALSE, use.true.class = TRUE)
    5: .J(Class@name, ...)
    4: new(J("java.io.FileInputStream"), name <- localfile)
    3: new(J("java.io.FileInputStream"), name <- localfile)
    2: load_doc(file, password = password, copy = copy)
    1: extract_tables(PATH, output = "data.frame", pages = c(10, 11))

and sessionInfo() returns this: sessionInfo()返回:

    R version 3.6.1 (2019-07-05)
    Platform: x86_64-apple-darwin15.6.0 (64-bit)
    Running under: macOS Mojave 10.14.6
    
    Matrix products: default
    BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
    LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
    
    locale:
    [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
    
    attached base packages:
    [1] stats     graphics  grDevices utils     datasets  methods   base     
    
    other attached packages:
     [1] forcats_0.4.0   stringr_1.4.0   dplyr_0.8.5     purrr_0.3.3     readr_1.3.1     tidyr_1.0.3     tibble_3.0.1   
     [8] ggplot2_3.3.0   tidyverse_1.3.0 tabulizer_0.2.2
    
    loaded via a namespace (and not attached):
    
         [1] Rcpp_1.0.1          cellranger_1.1.0    pillar_1.4.3        compiler_3.6.1      dbplyr_1.4.2       
         [6] tools_3.6.1         lubridate_1.7.4     jsonlite_1.6        lifecycle_0.2.0     gtable_0.3.0       
        [11] nlme_3.1-140        lattice_0.20-38     pkgconfig_2.0.2     png_0.1-7           rlang_0.4.6        
        [16] reprex_0.3.0        cli_2.0.2           DBI_1.0.0           rstudioapi_0.11     haven_2.2.0        
        [21] rJava_0.9-12        withr_2.1.2         xml2_1.3.2          httr_1.4.1          fs_1.3.1           
        [26] hms_0.5.3           generics_0.0.2      vctrs_0.3.0         grid_3.6.1          tidyselect_1.1.0   
        [31] glue_1.3.1          R6_2.4.0            fansi_0.4.0         readxl_1.3.1        modelr_0.1.8       
        [36] magrittr_1.5        scales_1.0.0        tabulizerjars_1.0.1 backports_1.1.4     ellipsis_0.3.0     
        [41] rvest_0.3.5         assertthat_0.2.1    colorspace_1.4-1    stringi_1.4.6       munsell_0.5.0      
        [46] broom_0.5.6         crayon_1.3.4      

Thanks in advance for any help!提前感谢您的帮助!

As discussed in comments, code is working fine on windows.正如评论中所讨论的,代码在 windows 上运行良好。

library(tabulizer)
link <- "https://www.icnarc.org/DataServices/Attachments/Download/8419d345-c7a1-ea11-9126-00505601089b"
dfr.list <- extract_tables(link, output="data.frame", pages=10:11)

To get every table out of the list use list2env where you set env= ironment to .GlobalEnv which is your workspace getwd() .要将每个表从列表中删除,请使用list2env ,您将env= ironment 设置为.GlobalEnv ,这是您的工作区getwd() Beforehand you need to give the unnamed list names.事先您需要给出未命名的列表名称。

names(dfr.list) <- paste0("dfr", 1:length(dfr.list))  ## give names
list2env(dfr.list, envir=.GlobalEnv)  ## put to environment

ls()
# [1] "dfr.list"    "dfr1"        "dfr2"        "dfr3"        "link"  
# [2] "tables.list"

.pdf extraction often is not perfect and we have to clean the data afterwards. .pdf 提取通常并不完美,之后我们必须清理数据。 To improve the result try to play around with the area= , columns= , and options of extract_tables , read the help page ?extract_tables , consult the documentation.要改善结果,请尝试使用area=columns=extract_tables的选项,阅读帮助页面?extract_tables ,查阅文档。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM