R - iterate over pages in PDF

Question

I have a series of PDF files that contain various tables of data. I am only looking for a specific table in each and my goal is to find what page it is on for each file.

My planned approach is to somehow iterate over each page, read the text and determine if it is the page I'm looking for, if yes then return that page number, else continue to the next page. I've been looking into PDFTools, but it doesn't look like there is a way to loop through the pages.

Does anyone know of any R package that will help me achieve this, or is there a better way I can do this with PDFTools?

Any help will be much appreciated!

Answer 1

I think in PDFtools there are ways to extracting text data that creates 'strings' page by page. So code may look like this:

library(pdftools)
txt <- pdf_text("something.pdf")

Now:

# first page text
txt[1]
txt[2] etc.

In order to extract words from each string you have to use strsplit() and then create a vector of words of each page and look for page by page and inside that word by word. Once that matches with your word collect the outermost loop index number as number of page.

let me know if this helps your purpose.

R - iterate over pages in PDF

Question

1 answers

solution1
0 ACCPTED 2017-01-19 15:34:47

R - iterate over pages in PDF

Question

1 answers

solution1 0 ACCPTED 2017-01-19 15:34:47

solution1
0 ACCPTED 2017-01-19 15:34:47