R 脚本 - PDF 错误：十六进制字符串中的非法字符；当我搜索关键字时

Question

I am trying to count the number of keywords in multiple pdf files.我正在尝试计算多个 pdf 文件中的关键字数量。

library(tm)
library(pdftools)

files <- list.files(pattern = "pdf$")
Rpdf <- readPDF(control = list(text = "-layout"))
corp <- Corpus(URISource(files), readerControl = list(reader = Rpdf))

words <- c("example", "keyword", "test")
dt <- DocumentTermMatrix(corp, control=list(dictionary=words))

When I run the code I always get this errors:当我运行代码时，我总是收到以下错误：

PDF error: May not be a PDF file (continuing anyway)
PDF error (3): Illegal character <21> in hex string
PDF error (5): Illegal character <4f> in hex string
PDF error (7): Illegal character <54> in hex string
PDF error (8): Illegal character <59> in hex string
PDF error (9): Illegal character <50> in hex string
PDF error: Couldn't find trailer dictionary
PDF error: Couldn't find trailer dictionary
PDF error: Couldn't read xref table
Error in poppler_pdf_text(loadfile(pdf), opw, upw) : PDF parsing failure.
In addition: There were 12 warnings (use warnings() to see them)

If you have any suggestions, please let me know.如果您有任何建议，请让我知道。 Thank you!谢谢！

Answer 1

I guess your pdfs are formatted as binary files and should thus be downnloaded/read as binary files.我猜您的 pdf 格式为二进制文件，因此应该作为二进制文件下载/读取。 I had a similar issue downloading pdf files with download.file .我有一个类似的问题在下载PDF文件download.file 。 I couldnt mine infos from the pdf using pdftools after I downloaded them.下载后，我无法使用pdftools从 pdf 中挖掘信息。 I discovered that my pdfs where binary files and where broken bc I didnt download them in proper format (try using any pdf reader, it should say it's broken when opening your pdf).我发现我的 pdf 文件的二进制文件和损坏的地方 bc 我没有以正确的格式下载它们（尝试使用任何 pdf 阅读器，打开 pdf 时它应该说它坏了）。 Using Windows as OS I added mode="wb" to download.file making sure it stores them in the right format.使用 Windows 作为操作系统我添加了mode="wb"到download.file确保它以正确的格式存储它们。 I could then run the functions from pdftools on it without that error message.然后，我可以在其上运行pdftools的函数，而不会出现该错误消息。 Hope that helps somehow.希望以某种方式有所帮助。 Got the idea from that SO question: Problems with Downloading pdf file using R从那个 SO 问题中得到了想法：使用 R 下载 pdf 文件的问题

Same error message as yours:与您相同的错误消息：

pdf_toc(example_path)
PDF error (1151926): Illegal character <3a> in hex string
PDF error (1151929): Illegal character <73> in hex string
[...omitted for brevity...]
PDF error (1152006): Illegal character <22> in hex string
PDF error: Couldn't find trailer dictionary
PDF error: Couldn't read xref table
Error in poppler_pdf_toc(loadfile(pdf), opw, upw) : PDF parsing failure.

R 脚本 - PDF 错误：十六进制字符串中的非法字符；当我搜索关键字时

问题描述

1 个解决方案

解决方案1
1 2020-04-15 12:39:01

R 脚本 - PDF 错误：十六进制字符串中的非法字符； 当我搜索关键字时

问题描述

1 个解决方案

解决方案1 1 2020-04-15 12:39:01

R 脚本 - PDF 错误：十六进制字符串中的非法字符；当我搜索关键字时

解决方案1
1 2020-04-15 12:39:01