[英]R script - PDF error: Illegal character in hex string; when I am searching for keywords
I am trying to count the number of keywords in multiple pdf files.我正在尝试计算多个 pdf 文件中的关键字数量。
library(tm)
library(pdftools)
files <- list.files(pattern = "pdf$")
Rpdf <- readPDF(control = list(text = "-layout"))
corp <- Corpus(URISource(files), readerControl = list(reader = Rpdf))
words <- c("example", "keyword", "test")
dt <- DocumentTermMatrix(corp, control=list(dictionary=words))
When I run the code I always get this errors:当我运行代码时,我总是收到以下错误:
PDF error: May not be a PDF file (continuing anyway)
PDF error (3): Illegal character <21> in hex string
PDF error (5): Illegal character <4f> in hex string
PDF error (7): Illegal character <54> in hex string
PDF error (8): Illegal character <59> in hex string
PDF error (9): Illegal character <50> in hex string
PDF error: Couldn't find trailer dictionary
PDF error: Couldn't find trailer dictionary
PDF error: Couldn't read xref table
Error in poppler_pdf_text(loadfile(pdf), opw, upw) : PDF parsing failure.
In addition: There were 12 warnings (use warnings() to see them)
If you have any suggestions, please let me know.如果您有任何建议,请让我知道。 Thank you!
谢谢!
I guess your pdfs are formatted as binary files and should thus be downnloaded/read as binary files.我猜您的 pdf 格式为二进制文件,因此应该作为二进制文件下载/读取。 I had a similar issue downloading pdf files with
download.file
.我有一个类似的问题在下载PDF文件
download.file
。 I couldnt mine infos from the pdf using pdftools
after I downloaded them.下载后,我无法使用
pdftools
从 pdf 中挖掘信息。 I discovered that my pdfs where binary files and where broken bc I didnt download them in proper format (try using any pdf reader, it should say it's broken when opening your pdf).我发现我的 pdf 文件的二进制文件和损坏的地方 bc 我没有以正确的格式下载它们(尝试使用任何 pdf 阅读器,打开 pdf 时它应该说它坏了)。 Using Windows as OS I added
mode="wb"
to download.file
making sure it stores them in the right format.使用 Windows 作为操作系统我添加了
mode="wb"
到download.file
确保它以正确的格式存储它们。 I could then run the functions from pdftools
on it without that error message.然后,我可以在其上运行
pdftools
的函数,而不会出现该错误消息。 Hope that helps somehow.希望以某种方式有所帮助。 Got the idea from that SO question: Problems with Downloading pdf file using R
从那个 SO 问题中得到了想法: 使用 R 下载 pdf 文件的问题
Same error message as yours:与您相同的错误消息:
pdf_toc(example_path)
PDF error (1151926): Illegal character <3a> in hex string
PDF error (1151929): Illegal character <73> in hex string
[...omitted for brevity...]
PDF error (1152006): Illegal character <22> in hex string
PDF error: Couldn't find trailer dictionary
PDF error: Couldn't read xref table
Error in poppler_pdf_toc(loadfile(pdf), opw, upw) : PDF parsing failure.
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.