在 Python 中將 PDF 轉換為 Excel

[英]Convert PDF to Excel in Python

我使用 Python 將 pdf 文件轉換為 excel。但是,我的 pdf 文件中的某些行比其他行大,即變量名稱更長,go 進入下一行。

當我轉換 pdf 時,這些較長的字符串變量在我的 excel 文檔中重疊成多行。

有什么方法可以更高效、更准確地從 pdf 導入表?


import tabula

file = r"mydirectory.pdf"

pdfData = tabula.read_pdf(file,pages="all") [0]

tabula.convert_into(file, r"mydirectory.csv", pages = "all")

以下腳本使用 PyMuPDF 並解決了這個問題。


Example script: Using PyMuPDF for table analysis

This script extracts cell content for a table on a PDF page and outputs
a CSV file representing the table.

The script will work successfully if the following conditions are met:

1. The table is / can be identified by a boundary box (rectangle).
2. Table has a clean (row x column) format.
3. Each table cell is wrapped by line drawings.

The script executes the following steps:

Step 0: Identify a clip rectangle containing the table. This may work by
        identifying text keyword coordinates (the example presented here)
        or by whatever other mechanism.
Step 1: Extract x- and y-coordinates of vector graphic lines. They are
        being used as cell borders to determine the right cell for each
        piece of text. Create a Python table with empty text cells.
Step 2: Extract page text pieces ("spans") within the clip and sort them
        by vertical, then horizontal coordinates. Sorting is required to
        ensure correct sequence of multi-line table cell text content.
        For each text piece, append it to the respective cell text.
Step 3: Output Python table as CSV file.
import fitz

# make minimal wrapping rectangles

doc = fitz.open("test.pdf")
page = doc[0]  # first page

# -------------------------------------------------------------------------
# Step 0: Identify clip rectangle
# Look up top and bottom coordinates for relevant data
# -------------------------------------------------------------------------
top = page.search_for("Basic Project Information")[0].y1
bot = page.search_for("page")[0].y0

# so we extract info from the following rectangle
clip = fitz.Rect(0, top, page.rect.width, bot)

# -------------------------------------------------------------------------
# Step 1: Compute x-, y-coordinates of cell borders
# Find table border line coordinates
# -------------------------------------------------------------------------
paths = page.get_drawings()  # all line art
vert = set()  # vertical (x-) coordinates
hori = set()  # horizontal (y-) coordinates
for p in paths:  # walk thru vector graphis to find the lines
    if p["rect"].y0 < top or p["rect"].y1 > bot:  # omit stuff outside clip
    for item in p["items"]:  # look at lines and "thin" rectangles
        if item[0] == "l":  # a line
            p1, p2 = item[1:]
            if p1.x == p2.x:  # vertical line
                vert.add(p1.x)  # store column border
            elif p1.y == p2.y:  # horizontal line
                hori.add(p1.y)  # store row border
        elif item[0] == "re":  # a rectangle item
            rect = item[1]  # rect coordinates
            if rect.width <= 3 and rect.height > 10:
                vert.add(rect.x0)  # thin vertical rect: treat like col line
            elif rect.height <= 3 and rect.width > 10:
                hori.add(rect.y1)  # treat like row line

vert = sorted(list(vert))  # sorted, without duplicates
hori = sorted(list(hori))  # sorted, without duplicates
# Define table cells with these values:
# * has len(hori)-1 rows
# * every row has len(vert)-1 columns
cells = [[""] * (len(vert) - 1) for j in range(len(hori) - 1)]

# -------------------------------------------------------------------------
# Step 2: Extract text spans
# Extract and sort text spans. We use the "dict" output format.
# -------------------------------------------------------------------------
# read text with all details into this list
spans = []

text = page.get_text("dict", flags=fitz.TEXTFLAGS_TEXT, clip=clip)
for block in text["blocks"]:
    for line in block["lines"]:
        for span in line["spans"]:
            spans.append(span)  # is text dict within whatever cell

spans.sort(key=lambda s: (s["bbox"][3], s["bbox"][0]))

def getcoord(bbox, text):
    """Find row / col index for given text rect."""
    I = -1  # row index
    J = -1  # col index
    for i in range(len(vert) - 1):
        if vert[i] <= bbox.x0 < bbox.x1 <= vert[i + 1]:
            I = i
    for j in range(len(hori) - 1):
        if hori[j] <= bbox.y0 < bbox.y1 <= hori[j + 1]:
            J = j
    if I < 0 or J < 0:  # shouldn't happen: correct cell not found
        raise ValueError(J, I, "PROBLEM", text)
    return J, I  # row, col index

# put the text pieces into the cells
for s in spans:
    j, i = getcoord(fitz.Rect(s["bbox"]), s["text"])
    cells[j][i] += s["text"]  # append to stuff already in that cell

# -------------------------------------------------------------------------
# Step 3: Output the CSV file
# -------------------------------------------------------------------------
out = open("table.csv", "w")
for line in cells:
    out.write(";".join(line) + "\n")


