簡體   English   中英

使用python 3.6閱讀pdf文件

[英]Reading pdf files with python 3.6

有沒有辦法用python 3.6打開和讀取pdf文件? 我試圖用一些庫和工具(如PyPDF2和pdfrw)讀取pdf文件,但它們都不能提取pdf文檔的文本內容。 任何形式的幫助將不勝感激。

嘗試: PyMuPDF

Python配方: PDF文本提取使用FITZ / MUPDF(PYMUPDF):

    #!/usr/bin/env python
Created on Wed Jul 29 07:00:00 2015

@author: Jorj McKie
Copyright (c) 2015 Jorj X. McKie

The license of this program is governed by the GNU GENERAL PUBLIC LICENSE
Version 3, 29 June 2007. See the "COPYING" file of this repository.

This is an example for using the Python binding PyMuPDF of MuPDF.

This program extracts the text of an input PDF and writes it in a text file.
The input file name is provided as a parameter to this script (sys.argv[1])
The output file name is input-filename appended with ".txt".
Encoding of the text in the PDF is assumed to be UTF-8.
Change the ENCODING variable as required.
import fitz                 # this is PyMuPDF
import sys, json


def SortBlocks(blocks):
    Sort the blocks of a TextPage in ascending vertical pixel order,
    then in ascending horizontal pixel order.
    This should sequence the text in a more readable form, at least by
    convention of the Western hemisphere: from top-left to bottom-right.
    If you need something else, change the sortkey variable accordingly ...

    sblocks = []
    for b in blocks:
        x0 = str(int(b["bbox"][0]+0.99999)).rjust(4,"0") # x coord in pixels
        y0 = str(int(b["bbox"][1]+0.99999)).rjust(4,"0") # y coord in pixels
        sortkey = y0 + x0                                # = "yx"
        sblocks.append([sortkey, b])
    return [b[1] for b in sblocks] # return sorted list of blocks

def SortLines(lines):
    ''' Sort the lines of a block in ascending vertical direction. See comment
    in SortBlocks function.
    slines = []
    for l in lines:
        y0 = str(int(l["bbox"][1] + 0.99999)).rjust(4,"0")
        slines.append([y0, l])
    return [l[1] for l in slines]

def SortSpans(spans):
    ''' Sort the spans of a line in ascending horizontal direction. See comment
    in SortBlocks function.
    sspans = []
    for s in spans:
        x0 = str(int(s["bbox"][0] + 0.99999)).rjust(4,"0")
        sspans.append([x0, s])
    return [s[1] for s in sspans]

# Main Program
ifile = sys.argv[1]
ofile = ifile + ".txt"

doc = fitz.Document(ifile)
pages = doc.pageCount
fout = open(ofile,"w")

for i in range(pages):
    pg_text = ""                                 # initialize page text buffer
    pg = doc.loadPage(i)                         # load page number i
    text = pg.getText(output = 'json')           # get its text in JSON format
    pgdict = json.loads(text)                    # create a dict out of it
    blocks = SortBlocks(pgdict["blocks"])        # now re-arrange ... blocks
    for b in blocks:
        lines = SortLines(b["lines"])            # ... lines
        for l in lines:
            spans = SortSpans(l["spans"])        # ... spans
            for s in spans:
                # ensure that spans are separated by at least 1 blank
                # (should make sense in most cases)
                if pg_text.endswith(" ") or s["text"].startswith(" "):
                    pg_text += s["text"]
                    pg_text += " " + s["text"]
            pg_text += "\n"                      # separate lines by newline

    pg_text = pg_text.encode(ENCODING, "ignore")


嘗試使用pdfrw 0.4

這是鏈接: https//pypi.python.org/pypi/pdfrw


聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

粵ICP備18138465號  © 2020-2024 STACKOOM.COM