忽略 readtext r 中的错误

Question

I am now trying to extract a large number of docx files (1500) placed in one folder, using readtext (after creating a list using list.files)我现在正在尝试使用 readtext 提取放置在一个文件夹中的大量 docx 文件（1500 个）（在使用 list.files 创建列表之后）

You can find similar examples here: https://cran.r-project.org/web/packages/readtext/vignettes/readtext_vignette.html您可以在此处找到类似的示例： https://cran.r-project.org/web/packages/readtext/vignettes/readtext_vignette.html

I am getting errors with some files (examples below), the problem is when this error occurs, the extraction process is stopped.我收到一些文件错误（下面的示例），问题是发生此错误时，提取过程停止。 I can identify the problematic file, by changing verbosity = 3 , but then I have to restart the extraction process (to find another problematic file(s)).我可以通过更改verbosity = 3来识别有问题的文件，但随后我必须重新启动提取过程（以查找另一个有问题的文件）。

My question is if there is a way to avoid interrupting the process if an error is encountered?我的问题是，如果遇到错误，是否有办法避免中断进程？

I change ignore_missing_files = TRUE but this did not fix the problem.我更改ignore_missing_files = TRUE但这并没有解决问题。

examples for the errors encountered:遇到的错误示例：

write error in extracting from zip file
Error: 'C:\Users--- c/word/document.xml' does not exist.

Sorry for not posting a reproducible example, but I do not know how to post an example with large docx files.很抱歉没有发布可重现的示例，但我不知道如何发布带有大型 docx 文件的示例。 But this is the code:但这是代码：

library(readtext)
 
data_files <- list.files(path = "PATH", full.names = T, recursive = T)   # PATH = the path to the folder where the documents are located
extracted_texts <- readtext(data_files, docvarsfrom = "filepaths", dvsep = "/", verbosity = 3, ignore_missing_files = TRUE) # this is to extract the text in the files

 
write.csv2(extracted_texts, file = "data/text_extracts.csv", fileEncoding = "UTF-8") # this is to export the files into csv

Answer 1

Let's first put together a reproducible example:让我们首先整理一个可重现的示例：

download.file("https://file-examples-com.github.io/uploads/2017/02/file-sample_1MB.docx", "test1.docx")
writeLines("", "test2.docx")

The first file I produced here should be a proper docx file, the second one is rubbish.我在这里生成的第一个文件应该是一个正确的 docx 文件，第二个是垃圾文件。

I would wrap readtext in a small function that deals with the errors and warnings:我会将readtext包装在一个处理错误和警告的小 function 中：

readtext_safe <- function(f) {
  out <- tryCatch(readtext::readtext(f), 
                  error = function(e) "fail",
                  warning = function(e) "fail")
  if (isTRUE("fail" == out)) {
    write(f, "errored_files.txt", append = TRUE)
  } else {
    return(out)
  }
}

Note that I treat errors and warning the same, which might not be what you actually want.请注意，我将错误和警告视为相同，这可能不是您真正想要的。 We can use this function to loop through your files:我们可以使用这个 function 来遍历您的文件：

files <- list.files(pattern = ".docx$", ignore.case = TRUE, full.names = TRUE)

x <- lapply(files, readtext_safe)
x
#> [[1]]
#> readtext object consisting of 1 document and 0 docvars.
#> # Description: df[,2] [1 × 2]
#>   doc_id     text               
#>   <chr>      <chr>              
#> 1 test1.docx "\"Lorem ipsu\"..."
#> 
#> [[2]]
#> NULL

In the resulting list, failed files simply have a NULL entry as nothing is returned.在结果列表中，失败的文件只有一个NULL条目，因为没有返回任何内容。 I like to write out a list of these errored files and the function above creates a txt file that looks like this:我喜欢写出这些错误文件的列表，上面的 function 创建了一个 txt 文件，如下所示：

readLines("errored_files.txt")
#> [1] "./test2.docx"

忽略 readtext r 中的错误

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-08-07 16:31:21

忽略 readtext r 中的错误

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-08-07 16:31:21

解决方案1
2 已采纳 2020-08-07 16:31:21