简体   繁体   中英

python to search within pdf file

here is part of pdf structure:

5 0 obj
<< /Length 56 >>
stream
BT /F1 12 Tf 100 700 Td 15 TL (JavaScript example) Tj ET
endstream
endobj
6 0 obj
<<
/Type /Font
/Subtype /Type1
/Name /F1
/BaseFont /Helvetica
/Encoding /MacRomanEncoding
>>
endobj
7 0 obj
<<
/Type /Action
/S /JavaScript

I want to search for "javascript" if its there or not. the problem with it that javascript can be represented by its hex as a whole or part ot it "javascript or Jav#61Script or J#61v#61Script and so on"

so how could I find out if javascript is exist with all of this possibilities ????

Read it in a character at a time and translate any hex you find to characters as you go, also translating to lowercase. Compare the result to "javascript".

Here's an idea:

import string
import os
import re

def pdf_find_str(pdfname, str):
  f = open(pdfname, "rb")

  # read the file CHUNK_SIZE chars at a time, keeping last KEEP_SIZE chars
  CHUNK_SIZE = 2*1024*1024
  KEEP_SIZE = 3 * len(str) # each char might be in #ff form
  hexvals = "0123456789abcdef"

  ichunk = removed = 0
  chunk = f.read(CHUNK_SIZE)
  while len(chunk) > 0:

    # Loop to find all #'s and replace them with the character they represent.
    hpos = chunk.find('#')
    while hpos != -1:
      if len(chunk)-hpos >= 3 and chunk[hpos+1] in hexvals and chunk[hpos+2] in hexvals:
        hex = int(chunk[hpos+1:hpos+3], 16)  # next two characters are int value
        ch = chr(hex).lower()
        if ch in str: # avoid doing this if ch is not in str
          chunk = chunk[:hpos] + ch + chunk[hpos+3:]
          removed += 2
      hpos = chunk.find('#', hpos+1)

    m = re.search(str, chunk, re.I)
    if m:
      return ichunk * (CHUNK_SIZE-KEEP_SIZE) + m.start()

    # Transfer last KEEP_SIZE characters to beginning for next round of
    # testing since our string may span chunks.
    next_chunk = f.read(CHUNK_SIZE - KEEP_SIZE)
    if len(next_chunk) == 0: break
    chunk = chunk[-KEEP_SIZE:] + next_chunk

    ichunk += 1

  f.close()
  return -1

# On one file:
#if pdf_find_str("Consciousness Explained.pdf", "javascript") != -1:
#  print 'Contains "javascript"'

# Recursively on a directory:
for root, dirs, files in os.walk("Books"):
  for file in files:
    if file.endswith(".pdf"):
      position = pdf_find_str(root + "/" + file, "javascript")
      if position != -1:
        print file, "(", position, ")"
# Note: position returned by pdf_find_str does not account for removed
# characters from #ff representations (if any).

PDFs can be hard to deal with. I recommend that you look into something like PDFMiner .

Python to search a word/text/character within pdf file and print page number and position of its occurrence

  1. PyMuPDF
  2. PyPDF2

PyMuPDF

import fitz
import re

# load document
doc = fitz.open(r"srcfile_path")

# define keyterms
String = "P4F-21B"

# get text, search for string and print count on page.
for page in doc:
    text = ''
    text += page.get_text()
    if len(re.findall(String, text)) > 0:
        print(f'count on page {page.number + 1} is: {len(re.findall(String, text))}')

PyPDF2

# import packages
import PyPDF2
import re

# open the pdf file
object = PyPDF2.PdfFileReader(r"srcfile_path")

# get number of pages
NumPages = object.getNumPages()

# define keyterms
String = "P4F-21B"

# extract text and do the search
for i in range(0, NumPages):
    PageObj = object.getPage(i)
    Text = PageObj.extractText()
    ResSearch = re.search(String, Text)
    if ResSearch != None:
        print(ResSearch)
        print("Page Number" + str(i+1))

Steps to search a word/text/character within pdf file using Python

  1. Reading the PDF File : Use any PDF File handling libraries( PyMuPDF, PyPDF2) and read the PDF file
  2. Fetch the total pages of PDF File
  3. Use String recognition using regular expression matching technique

For more details refer to my post in https://stackoverflow.com/a/70093305/17385292

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM