Extracting Page-level ASCII Text from a Collection of Multi-page PDFs?

Question

I am trying to get page level ASCII text out of a series of multi-page PDFs. My current process is to split all of the PDFs with Sejda (an awesome tool) in batch and then extract text from the divided PDFs (in Sejda as batch) to corresponding text files. Is there an easy way to bypass the splitting phase and go straight to the page-level TXT files? I would like to just input a collection of multi-page PDFs and OUTPUT a corresponding TXT files for each page of each PDF. Any input or insight would be appreciated.

My process

File.pdf --> File-001.pdf; File-002.pdf; etc. --> File-001.txt; File-002.txt; etc

Answer 1

Sejda version 1.0.0.M8 has the task that you are looking for: ExtractTextByPages

Example usage from the command line:

bin/sejda-console extracttextbypages -f /tmp/file.pdf -o /tmp -e "UTF-8" --pageNumbers 1 3 5

Extracting Page-level ASCII Text from a Collection of Multi-page PDFs?

Question

1 answers

solution1
1 2013-10-26 11:47:51

Extracting Page-level ASCII Text from a Collection of Multi-page PDFs?

Question

1 answers

solution1 1 2013-10-26 11:47:51

solution1
1 2013-10-26 11:47:51