简体   繁体   English

R 脚本 - PDF 错误:十六进制字符串中的非法字符; 当我搜索关键字时

[英]R script - PDF error: Illegal character in hex string; when I am searching for keywords

I am trying to count the number of keywords in multiple pdf files.我正在尝试计算多个 pdf 文件中的关键字数量。

library(tm)
library(pdftools)

files <- list.files(pattern = "pdf$")
Rpdf <- readPDF(control = list(text = "-layout"))
corp <- Corpus(URISource(files), readerControl = list(reader = Rpdf))

words <- c("example", "keyword", "test")
dt <- DocumentTermMatrix(corp, control=list(dictionary=words))

When I run the code I always get this errors:当我运行代码时,我总是收到以下错误:

PDF error: May not be a PDF file (continuing anyway)
PDF error (3): Illegal character <21> in hex string
PDF error (5): Illegal character <4f> in hex string
PDF error (7): Illegal character <54> in hex string
PDF error (8): Illegal character <59> in hex string
PDF error (9): Illegal character <50> in hex string
PDF error: Couldn't find trailer dictionary
PDF error: Couldn't find trailer dictionary
PDF error: Couldn't read xref table
Error in poppler_pdf_text(loadfile(pdf), opw, upw) : PDF parsing failure.
In addition: There were 12 warnings (use warnings() to see them)

If you have any suggestions, please let me know.如果您有任何建议,请让我知道。 Thank you!谢谢!

I guess your pdfs are formatted as binary files and should thus be downnloaded/read as binary files.我猜您的 pdf 格式为二进制文件,因此应该作为二进制文件下载/读取。 I had a similar issue downloading pdf files with download.file .我有一个类似的问题在下载PDF文件download.file I couldnt mine infos from the pdf using pdftools after I downloaded them.下载后,我无法使用pdftools从 pdf 中挖掘信息。 I discovered that my pdfs where binary files and where broken bc I didnt download them in proper format (try using any pdf reader, it should say it's broken when opening your pdf).我发现我的 pdf 文件的二进制文件和损坏的地方 bc 我没有以正确的格式下载它们(尝试使用任何 pdf 阅读器,打开 pdf 时它应该说它坏了)。 Using Windows as OS I added mode="wb" to download.file making sure it stores them in the right format.使用 Windows 作为操作系统我添加了mode="wb"download.file确保它以正确的格式存储它们。 I could then run the functions from pdftools on it without that error message.然后,我可以在其上运行pdftools的函数,而不会出现该错误消息。 Hope that helps somehow.希望以某种方式有所帮助。 Got the idea from that SO question: Problems with Downloading pdf file using R从那个 SO 问题中得到了想法: 使用 R 下载 pdf 文件的问题

Same error message as yours:与您相同的错误消息:

pdf_toc(example_path)
PDF error (1151926): Illegal character <3a> in hex string
PDF error (1151929): Illegal character <73> in hex string
[...omitted for brevity...]
PDF error (1152006): Illegal character <22> in hex string
PDF error: Couldn't find trailer dictionary
PDF error: Couldn't read xref table
Error in poppler_pdf_toc(loadfile(pdf), opw, upw) : PDF parsing failure.

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用R中的脚本生成pdf时出错 - Error when generating pdf using script in R R 错误:&#39;\\U&#39; 在以 &quot;&#39;C:\\U&quot; 开头的字符串中没有使用十六进制数字 - R error: '\U' used without hex digits in character string starting "'C:\U" R运行时出现命令行错误:在字符串“&#39;C:\\ U”中不带十六进制数字的情况下使用&#39;\\ U&#39; - R running with Command Line Error: '\U' used without hex digits in character string starting “'C:\U” 起始 R: 错误: &#39;\\U&#39; 在以 &quot;&quot;C:\\U&quot; 开头的字符串中没有十六进制数字的情况下使用 - starting R: Error: '\U' used without hex digits in character string starting ""C:\U" 使用 Windows 的 R 中的文件路径问题(“字符串中的十六进制数字”错误) - File path issues in R using Windows ("Hex digits in character string" error) R 错误:在以“'C:\U”开头的字符串中使用了没有十六进制数字的'\U',通常的解决方案都不起作用 - R Error: '\U' used without hex digits in character string starting “'C:\U”, none of the usual solutions work 当我不积极使用计算机时,R脚本停止运行 - R script stops running when I am not actively using the computer 使用 R 在 pdf 中搜索组合关键字 - Search combined keywords in a pdf with R 更换字符串上的字符时出错 - r error on replacing a character on a string 关键字搜索字符串 - Keyword searching a character string
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM