简体   繁体   English

PDF文件导入R

[英]PDF File Import R

I have multiple .pdf-files (stored in a local folder), that contain text.我有多个.pdf 文件(存储在本地文件夹中),其中包含文本。 I would like to import the .pdf-files (ie, the texts) in R. I applied the function ' read_dir ' (R package: [textreadr][1] )我想在 R 中导入 .pdf 文件(即文本)。我应用了函数“ read_dir ”(R 包: [textreadr][1]

library ("textreadr")
Data <- read_dir("<MY PATH>")

The function works well.该功能运行良好。 BUT.但是。 For several files, that include special characters (ie, letters) in their names (such as ' ć '; eg, 'filenameć.pdf'), the function did not work (error message: ' The following files failed to read in and were removed: ' …).对于多个文件,包括在其名称的特殊字符(即字母)(如“ ć ‘;例如,‘filenameć.pdf’),功能不工作(错误信息:’ The following files failed to read in and were removed: ' ...)。

What can I do?我能做什么?

I tried to rename the files via R ( did not work (probably due to the same reasons)).我试图通过 R重命名文件不起作用(可能由于相同的原因))。 That might be a workaround.这可能是一种解决方法。

I did not want to rename the files manually :)我不想手动重命名文件:)

Follow-Up (only for experts): For several files, I got one of the following error messages (and I have no idea why):跟进(仅针对专家):对于多个文件,我收到以下错误消息之一(我不知道为什么):

PDF error: Mismatch between font type and embedded font file

or

PDF error: Couldn't find trailer dictionary

Any suggestions or hints how to solve this issue?任何建议或提示如何解决这个问题?

Likely the issue concerns the encoding of the file names.问题可能与文件名的编码有关。 If you absolutely want to use R to rename the files for you, the function you want to use is iconv, determine the encoding of the file names and then convert them to utf-8.如果你绝对想用R为你重命名文件,你要使用的函数是iconv,确定文件名的编码,然后将它们转换为utf-8。

However, a much better system would imply renaming them using bash from command line.然而,一个更好的系统意味着从命令行使用 bash 重命名它们。 Can you provide a more complete set of examples?你能提供一组更完整的例子吗?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM