忽略 readtext r 中的錯誤

Question

我現在正在嘗試使用 readtext 提取放置在一個文件夾中的大量 docx 文件（1500 個）（在使用 list.files 創建列表之后）

您可以在此處找到類似的示例： https://cran.r-project.org/web/packages/readtext/vignettes/readtext_vignette.html

我收到一些文件錯誤（下面的示例），問題是發生此錯誤時，提取過程停止。 我可以通過更改verbosity = 3來識別有問題的文件，但隨后我必須重新啟動提取過程（以查找另一個有問題的文件）。

我的問題是，如果遇到錯誤，是否有辦法避免中斷進程？

我更改ignore_missing_files = TRUE但這並沒有解決問題。

遇到的錯誤示例：

write error in extracting from zip file
Error: 'C:\Users--- c/word/document.xml' does not exist.

很抱歉沒有發布可重現的示例，但我不知道如何發布帶有大型 docx 文件的示例。 但這是代碼：

library(readtext)
 
data_files <- list.files(path = "PATH", full.names = T, recursive = T)   # PATH = the path to the folder where the documents are located
extracted_texts <- readtext(data_files, docvarsfrom = "filepaths", dvsep = "/", verbosity = 3, ignore_missing_files = TRUE) # this is to extract the text in the files

 
write.csv2(extracted_texts, file = "data/text_extracts.csv", fileEncoding = "UTF-8") # this is to export the files into csv

Answer 1

讓我們首先整理一個可重現的示例：

download.file("https://file-examples-com.github.io/uploads/2017/02/file-sample_1MB.docx", "test1.docx")
writeLines("", "test2.docx")

我在這里生成的第一個文件應該是一個正確的 docx 文件，第二個是垃圾文件。

我會將readtext包裝在一個處理錯誤和警告的小 function 中：

readtext_safe <- function(f) {
  out <- tryCatch(readtext::readtext(f), 
                  error = function(e) "fail",
                  warning = function(e) "fail")
  if (isTRUE("fail" == out)) {
    write(f, "errored_files.txt", append = TRUE)
  } else {
    return(out)
  }
}

請注意，我將錯誤和警告視為相同，這可能不是您真正想要的。 我們可以使用這個 function 來遍歷您的文件：

files <- list.files(pattern = ".docx$", ignore.case = TRUE, full.names = TRUE)

x <- lapply(files, readtext_safe)
x
#> [[1]]
#> readtext object consisting of 1 document and 0 docvars.
#> # Description: df[,2] [1 × 2]
#>   doc_id     text               
#>   <chr>      <chr>              
#> 1 test1.docx "\"Lorem ipsu\"..."
#> 
#> [[2]]
#> NULL

在結果列表中，失敗的文件只有一個NULL條目，因為沒有返回任何內容。 我喜歡寫出這些錯誤文件的列表，上面的 function 創建了一個 txt 文件，如下所示：

readLines("errored_files.txt")
#> [1] "./test2.docx"

忽略 readtext r 中的錯誤

問題描述

1 個解決方案

解決方案1
2 已采納 2020-08-07 16:31:21

忽略 readtext r 中的錯誤

問題描述

1 個解決方案

解決方案1 2 已采納 2020-08-07 16:31:21

解決方案1
2 已采納 2020-08-07 16:31:21