简体   繁体   中英

R - iterate over pages in PDF

I have a series of PDF files that contain various tables of data. I am only looking for a specific table in each and my goal is to find what page it is on for each file.

My planned approach is to somehow iterate over each page, read the text and determine if it is the page I'm looking for, if yes then return that page number, else continue to the next page. I've been looking into PDFTools, but it doesn't look like there is a way to loop through the pages.

Does anyone know of any R package that will help me achieve this, or is there a better way I can do this with PDFTools?

Any help will be much appreciated!

I think in PDFtools there are ways to extracting text data that creates 'strings' page by page. So code may look like this:

library(pdftools)
txt <- pdf_text("something.pdf")

Now:

# first page text
txt[1]
txt[2] etc.

In order to extract words from each string you have to use strsplit() and then create a vector of words of each page and look for page by page and inside that word by word. Once that matches with your word collect the outermost loop index number as number of page.

let me know if this helps your purpose.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM