简体   繁体   中英

Script to search for text from PDF

Problem

On the Mac OS X platform, I would like to write a script, either in Python or Tcl to search for text within a PDF file and extract the relevant parts. I appreciate any help.

Background

I am writing scripts to look inside a PDF to determine if it is a bill, from what company, and for what period. Based on these information, I rename the PDF and move it to an appropriate directory. For example, file such as Statement_03948293929384.pdf might become 2012-07-15 Water Bill.pdf and moved to my Utilities folder.

What have I done so far?

  • I have searched for PDF-to-plain-text tools, but not found anything yet
  • I have looked into the Tcl wiki and found an example, but could not get it to work (I searched for text in PDF, but not found).
  • I am looking into pdf-parser.py by Didier Stevens
  • I heard of a Python package called pyPdf and will look at it next.

Update

I have found a command-line tool called pdftotext written by Glyph & Cog, LLC; built and packaged by Carsten Bluem . This tool is straight forward and it solves my problem. I am still looking out for those tools that can search PDF directly, without having to convert to text file.

I have successfully used PyODConverter to convert to/from PDFs (there is also a more powerful Java version). Once you have the PDF converted to text it should be trivial to do the searching. Also I believe iText should be capable of doing similar things, but I haven't tested it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM