简体   繁体   中英

Extracting Page-level ASCII Text from a Collection of Multi-page PDFs?

I am trying to get page level ASCII text out of a series of multi-page PDFs. My current process is to split all of the PDFs with Sejda (an awesome tool) in batch and then extract text from the divided PDFs (in Sejda as batch) to corresponding text files. Is there an easy way to bypass the splitting phase and go straight to the page-level TXT files? I would like to just input a collection of multi-page PDFs and OUTPUT a corresponding TXT files for each page of each PDF. Any input or insight would be appreciated.

My process

File.pdf --> File-001.pdf; File-002.pdf; etc. --> File-001.txt; File-002.txt; etc

Sejda version 1.0.0.M8 has the task that you are looking for: ExtractTextByPages

Example usage from the command line:

bin/sejda-console extracttextbypages -f /tmp/file.pdf -o /tmp -e "UTF-8" --pageNumbers 1 3 5

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM