简体   繁体   English

在R中使用readPDF(tm软件包)

[英]Using readPDF in R (tm package)

I'm a beginner at R and having a bit of trouble using the tm package. 我是R的初学者,在使用tm软件包时遇到了一些麻烦。 I need to extract specific data from page 55 through 300 of this and thought that R might be a good way to do so. 我需要从第55页到第300页中提取特定数据,并认为R可能是这样做的好方法。 (If anyone has a better idea, please let me know!) I did some searching and after installing the tm package and the xpdf package, I've tried reading this and tried zx8754's solution with no luck. (如果任何人有一个更好的主意,请让我知道!),我做了一些搜索和安装后的tm封装和xpdf包,我已经试过读 ,并试图zx8754与没有运气的解决方案。 I suspect it has something to do with the readPDF command -- I get the following: 我怀疑这与readPDF命令有关-我得到以下信息:

Error in readPDF(PdftotextOptions = "-layout") : unused argument (PdftotextOptions = "-layout") readPDF(PdftotextOptions =“ -layout”)错误:未使用的参数(PdftotextOptions =“ -layout”)

I think it has to do with trying to use the tm package and the xpdf packages together, and so I read Tony Breyal's solution (I can't post more than 2 links), putting pdfinfo and pdftotext as environmental variables (I'm on Win 8) and restarting. 我认为这与尝试同时使用tm包和xpdf包有关,因此我阅读了Tony Breyal的解决方案(我不能发布两个以上的链接),将pdfinfo和pdftotext用作环境变量(我在赢8),然后重新启动。 I'm sure I'm missing something -- right now I have pdftotext.exe in my working directory in R. Can anyone help me configure this correctly so that the tm package calls on the xpdf files correctly and readPDF functions like it should? 我确定我丢失了一些东西-现在我在R的工作目录中有pdftotext.exe。谁能帮助我正确配置此文件,以便tm包正确调用xpdf文件并像应该的readPDF函数?

Again, I'm very new to this, so apologies if I'm way off. 再次,我对此很陌生,如果我离开的话,我深表歉意。 All help would be very much appreciated. 所有帮助将不胜感激。

Thanks in advance, 提前致谢,

Justin 贾斯汀

To get you started, here is an example of a complete readPDF command for reading a PDF file. 为了让您入门,这是读取PDF文件的完整readPDF命令的示例。 readPDF threw an error when I tried to retrieve the PDF file directly from the link you provided, so I downloaded the PDF file to my working directory first. 当我尝试直接从您提供的链接中检索PDF文件时, readPDF引发了错误,因此我首先将PDF文件下载到了我的工作目录中。

library(tm)

# File name
filename = "ea0607.pdf"

# Read the PDF file
doc <- readPDF(control = list(text = "-layout"))(elem = list(uri = filename),
                                                 language = "en",
                                                 id = "id1")

The code above converted the PDF file to text and stored the result in doc . 上面的代码将PDF文件转换为文本,并将结果存储在doc doc is actually a list, as can be seen with the following code: doc实际上是一个列表,如以下代码所示:

str(doc)

List of 2
 $ content: chr [1:23551] "  STATE UNIVERSITY SYSTEM OF FLORIDA" "" "EXPENDITURE ANALYSIS" "      2006-2007" ...
 $ meta   :List of 7
  ..$ author       : chr "greg.jacques"
  ..$ datetimestamp: POSIXlt[1:1], format: "2007-12-10 11:33:48"
  ..$ description  : NULL
  ..$ heading      : chr " PGM=EASUSI-V01                                        STATE UNIVERSITY SYSTEM                                                 "| __truncated__
  ..$ id           : chr "ea0607.pdf"
  ..$ language     : chr "en"
  ..$ origin       : chr "Acrobat PDFMaker 8.1 for Word"
  ..- attr(*, "class")= chr "TextDocumentMeta"
 - attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"

The text of the PDF file is stored in doc$content , while doc$meta includes various metadata about the PDF file. PDF文件的文本存储在doc$content ,而doc$meta包含有关PDF文件的各种元数据。 Each row of doc$content is a line from the PDF file. doc$content每一行都是PDF文件中的一行。 Here are lines 300 through 310 of the PDF file: 这是PDF文件的300至310行:

doc$content[300:310]

 [1] ""                                                                                                                      
 [2] "and General (E&G) budget entity. The Expenditure Analysis continues to reflect special units separately and the"       
 [3] ""                                                                                                                      
 [4] "traditional program components and related activities have been further defined to support the funding formula. The"   
 [5] ""                                                                                                                      
 [6] "Expenditure Analysis format was revised in 1995-96 to include all activities in the funding formula as well as college"
 [7] ""                                                                                                                      
 [8] "detail by activity for the UF Health Science Center, the USF Health Science Center and the FSU Medical School. A"      
 [9] ""                                                                                                                      
[10] "definition of each follows:"                                                                                           
[11] ""    

Hopefully that will help you get started. 希望这会帮助您入门。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM