简体   繁体   English

R-遍历PDF页面

[英]R - iterate over pages in PDF

I have a series of PDF files that contain various tables of data. 我有一系列包含各种数据表的PDF文件。 I am only looking for a specific table in each and my goal is to find what page it is on for each file. 我只在每个文件中查找一个特定的表,我的目标是找到每个文件在哪个页面上。

My planned approach is to somehow iterate over each page, read the text and determine if it is the page I'm looking for, if yes then return that page number, else continue to the next page. 我计划的方法是以某种方式遍历每个页面,阅读文本并确定它是否是我要查找的页面,如果是,则返回该页面编号,否则继续下一页。 I've been looking into PDFTools, but it doesn't look like there is a way to loop through the pages. 我一直在研究PDFTools,但似乎没有一种循环浏览页面的方法。

Does anyone know of any R package that will help me achieve this, or is there a better way I can do this with PDFTools? 有谁知道有任何R软件包可以帮助我实现这一目标,或者有没有更好的方法可以使用PDFTools做到这一点?

Any help will be much appreciated! 任何帮助都感激不尽!

I think in PDFtools there are ways to extracting text data that creates 'strings' page by page. 我认为在PDFtools中,有一些方法可以提取文本数据,从而逐页创建“字符串”。 So code may look like this: 因此代码可能如下所示:

library(pdftools)
txt <- pdf_text("something.pdf")

Now: 现在:

# first page text
txt[1]
txt[2] etc.

In order to extract words from each string you have to use strsplit() and then create a vector of words of each page and look for page by page and inside that word by word. 为了从每个string提取单词,您必须使用strsplit() ,然后创建每个页面的单词向量,并逐页查找该单词内的单词。 Once that matches with your word collect the outermost loop index number as number of page. 一旦与您的word匹配,请收集最外面的循环索引号作为页数。

let me know if this helps your purpose. 让我知道这是否对您有帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM