简体   繁体   中英

Convert PDF to Excel in Python

I convert pdf files to excel, using Python. However, some rows in my pdf file are larger than others, ie, names of variables are longer and go into the next row.

When I convert pdfs, these longer string variables are overlapped into multiple rows in my excel document.

Is there any way I can import tables from pdf, more efficiently and accurately?

My code:

import tabula

file = r"mydirectory.pdf"

pdfData = tabula.read_pdf(file,pages="all") [0]

tabula.convert_into(file, r"mydirectory.csv", pages = "all")

The following script uses PyMuPDF and solves this problem.

While there also exist solutions for more general table layouts, this one solves the case where cells are wrapped by line borders.

"""
Example script: Using PyMuPDF for table analysis
-------------------------------------------------

This script extracts cell content for a table on a PDF page and outputs
a CSV file representing the table.

The script will work successfully if the following conditions are met:

1. The table is / can be identified by a boundary box (rectangle).
2. Table has a clean (row x column) format.
3. Each table cell is wrapped by line drawings.

The script executes the following steps:

Step 0: Identify a clip rectangle containing the table. This may work by
        identifying text keyword coordinates (the example presented here)
        or by whatever other mechanism.
Step 1: Extract x- and y-coordinates of vector graphic lines. They are
        being used as cell borders to determine the right cell for each
        piece of text. Create a Python table with empty text cells.
Step 2: Extract page text pieces ("spans") within the clip and sort them
        by vertical, then horizontal coordinates. Sorting is required to
        ensure correct sequence of multi-line table cell text content.
        For each text piece, append it to the respective cell text.
Step 3: Output Python table as CSV file.
"""
import fitz

# make minimal wrapping rectangles
fitz.Tools().set_small_glyph_heights(True)

doc = fitz.open("test.pdf")
page = doc[0]  # first page

# -------------------------------------------------------------------------
# Step 0: Identify clip rectangle
# Look up top and bottom coordinates for relevant data
# -------------------------------------------------------------------------
top = page.search_for("Basic Project Information")[0].y1
bot = page.search_for("page")[0].y0

# so we extract info from the following rectangle
clip = fitz.Rect(0, top, page.rect.width, bot)

# -------------------------------------------------------------------------
# Step 1: Compute x-, y-coordinates of cell borders
# Find table border line coordinates
# -------------------------------------------------------------------------
paths = page.get_drawings()  # all line art
vert = set()  # vertical (x-) coordinates
hori = set()  # horizontal (y-) coordinates
for p in paths:  # walk thru vector graphis to find the lines
    if p["rect"].y0 < top or p["rect"].y1 > bot:  # omit stuff outside clip
        continue
    for item in p["items"]:  # look at lines and "thin" rectangles
        if item[0] == "l":  # a line
            p1, p2 = item[1:]
            if p1.x == p2.x:  # vertical line
                vert.add(p1.x)  # store column border
            elif p1.y == p2.y:  # horizontal line
                hori.add(p1.y)  # store row border
        elif item[0] == "re":  # a rectangle item
            rect = item[1]  # rect coordinates
            if rect.width <= 3 and rect.height > 10:
                vert.add(rect.x0)  # thin vertical rect: treat like col line
            elif rect.height <= 3 and rect.width > 10:
                hori.add(rect.y1)  # treat like row line

vert = sorted(list(vert))  # sorted, without duplicates
hori = sorted(list(hori))  # sorted, without duplicates
# Define table cells with these values:
# * has len(hori)-1 rows
# * every row has len(vert)-1 columns
cells = [[""] * (len(vert) - 1) for j in range(len(hori) - 1)]

# -------------------------------------------------------------------------
# Step 2: Extract text spans
# Extract and sort text spans. We use the "dict" output format.
# -------------------------------------------------------------------------
# read text with all details into this list
spans = []

text = page.get_text("dict", flags=fitz.TEXTFLAGS_TEXT, clip=clip)
for block in text["blocks"]:
    for line in block["lines"]:
        for span in line["spans"]:
            spans.append(span)  # is text dict within whatever cell

spans.sort(key=lambda s: (s["bbox"][3], s["bbox"][0]))


def getcoord(bbox, text):
    """Find row / col index for given text rect."""
    I = -1  # row index
    J = -1  # col index
    for i in range(len(vert) - 1):
        if vert[i] <= bbox.x0 < bbox.x1 <= vert[i + 1]:
            I = i
            break
    for j in range(len(hori) - 1):
        if hori[j] <= bbox.y0 < bbox.y1 <= hori[j + 1]:
            J = j
            break
    if I < 0 or J < 0:  # shouldn't happen: correct cell not found
        raise ValueError(J, I, "PROBLEM", text)
    return J, I  # row, col index


# put the text pieces into the cells
for s in spans:
    j, i = getcoord(fitz.Rect(s["bbox"]), s["text"])
    cells[j][i] += s["text"]  # append to stuff already in that cell

# -------------------------------------------------------------------------
# Step 3: Output the CSV file
# -------------------------------------------------------------------------
out = open("table.csv", "w")
for line in cells:
    out.write(";".join(line) + "\n")
out.close()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM