简体繁体中英

Script to search for text from PDF

原文 2012-07-19 22:51:05 0 1 python/ macos/ parsing/ pdf/ tcl

Problem

On the Mac OS X platform, I would like to write a script, either in Python or Tcl to search for text within a PDF file and extract the relevant parts. I appreciate any help.

Background

I am writing scripts to look inside a PDF to determine if it is a bill, from what company, and for what period. Based on these information, I rename the PDF and move it to an appropriate directory. For example, file such as Statement_03948293929384.pdf might become 2012-07-15 Water Bill.pdf and moved to my Utilities folder.

What have I done so far?

I have searched for PDF-to-plain-text tools, but not found anything yet
I have looked into the Tcl wiki and found an example, but could not get it to work (I searched for text in PDF, but not found).
I am looking into pdf-parser.py by Didier Stevens
I heard of a Python package called pyPdf and will look at it next.

Update

I have found a command-line tool called pdftotext written by Glyph & Cog, LLC; built and packaged by Carsten Bluem . This tool is straight forward and it solves my problem. I am still looking out for those tools that can search PDF directly, without having to convert to text file.

1 answers

I have successfully used PyODConverter to convert to/from PDFs (there is also a more powerful Java version). Once you have the PDF converted to text it should be trivial to do the searching. Also I believe iText should be capable of doing similar things, but I haven't tested it.

How to read Arabic text from PDF using Python script

Extract text from PDF

Search Multiple words from pdf

Search and replace for text within a pdf, in Python

Search and replace placeholder text in PDF with Python

unable to convert pdf to text using python script

Annotate specific text in pdf using ghost script

Extract text from a PDF with regex

Extract Text from MediaBox - PDF

Extract text from pdf to file

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How to read Arabic text from PDF using Python script Extract text from PDF Search Multiple words from pdf Search and replace for text within a pdf, in Python Search and replace placeholder text in PDF with Python unable to convert pdf to text using python script Annotate specific text in pdf using ghost script Extract text from a PDF with regex Extract Text from MediaBox - PDF Extract text from pdf to file

Related Tags

Script to search for text from PDF

Question

Problem

Background

What have I done so far?

Update

1 answers

solution1 1 2012-07-19 23:19:33

solution1
1 2012-07-19 23:19:33