I convert pdf files to excel, using Python. However, some rows in my pdf file are larger than others, ie, names of variables are longer and go into the next row.
When I convert pdfs, these longer string variables are overlapped into multiple rows in my excel document.
Is there any way I can import tables from pdf, more efficiently and accurately?
My code:
import tabula
file = r"mydirectory.pdf"
pdfData = tabula.read_pdf(file,pages="all") [0]
tabula.convert_into(file, r"mydirectory.csv", pages = "all")
The following script uses PyMuPDF and solves this problem.
While there also exist solutions for more general table layouts, this one solves the case where cells are wrapped by line borders.
"""
Example script: Using PyMuPDF for table analysis
-------------------------------------------------
This script extracts cell content for a table on a PDF page and outputs
a CSV file representing the table.
The script will work successfully if the following conditions are met:
1. The table is / can be identified by a boundary box (rectangle).
2. Table has a clean (row x column) format.
3. Each table cell is wrapped by line drawings.
The script executes the following steps:
Step 0: Identify a clip rectangle containing the table. This may work by
identifying text keyword coordinates (the example presented here)
or by whatever other mechanism.
Step 1: Extract x- and y-coordinates of vector graphic lines. They are
being used as cell borders to determine the right cell for each
piece of text. Create a Python table with empty text cells.
Step 2: Extract page text pieces ("spans") within the clip and sort them
by vertical, then horizontal coordinates. Sorting is required to
ensure correct sequence of multi-line table cell text content.
For each text piece, append it to the respective cell text.
Step 3: Output Python table as CSV file.
"""
import fitz
# make minimal wrapping rectangles
fitz.Tools().set_small_glyph_heights(True)
doc = fitz.open("test.pdf")
page = doc[0] # first page
# -------------------------------------------------------------------------
# Step 0: Identify clip rectangle
# Look up top and bottom coordinates for relevant data
# -------------------------------------------------------------------------
top = page.search_for("Basic Project Information")[0].y1
bot = page.search_for("page")[0].y0
# so we extract info from the following rectangle
clip = fitz.Rect(0, top, page.rect.width, bot)
# -------------------------------------------------------------------------
# Step 1: Compute x-, y-coordinates of cell borders
# Find table border line coordinates
# -------------------------------------------------------------------------
paths = page.get_drawings() # all line art
vert = set() # vertical (x-) coordinates
hori = set() # horizontal (y-) coordinates
for p in paths: # walk thru vector graphis to find the lines
if p["rect"].y0 < top or p["rect"].y1 > bot: # omit stuff outside clip
continue
for item in p["items"]: # look at lines and "thin" rectangles
if item[0] == "l": # a line
p1, p2 = item[1:]
if p1.x == p2.x: # vertical line
vert.add(p1.x) # store column border
elif p1.y == p2.y: # horizontal line
hori.add(p1.y) # store row border
elif item[0] == "re": # a rectangle item
rect = item[1] # rect coordinates
if rect.width <= 3 and rect.height > 10:
vert.add(rect.x0) # thin vertical rect: treat like col line
elif rect.height <= 3 and rect.width > 10:
hori.add(rect.y1) # treat like row line
vert = sorted(list(vert)) # sorted, without duplicates
hori = sorted(list(hori)) # sorted, without duplicates
# Define table cells with these values:
# * has len(hori)-1 rows
# * every row has len(vert)-1 columns
cells = [[""] * (len(vert) - 1) for j in range(len(hori) - 1)]
# -------------------------------------------------------------------------
# Step 2: Extract text spans
# Extract and sort text spans. We use the "dict" output format.
# -------------------------------------------------------------------------
# read text with all details into this list
spans = []
text = page.get_text("dict", flags=fitz.TEXTFLAGS_TEXT, clip=clip)
for block in text["blocks"]:
for line in block["lines"]:
for span in line["spans"]:
spans.append(span) # is text dict within whatever cell
spans.sort(key=lambda s: (s["bbox"][3], s["bbox"][0]))
def getcoord(bbox, text):
"""Find row / col index for given text rect."""
I = -1 # row index
J = -1 # col index
for i in range(len(vert) - 1):
if vert[i] <= bbox.x0 < bbox.x1 <= vert[i + 1]:
I = i
break
for j in range(len(hori) - 1):
if hori[j] <= bbox.y0 < bbox.y1 <= hori[j + 1]:
J = j
break
if I < 0 or J < 0: # shouldn't happen: correct cell not found
raise ValueError(J, I, "PROBLEM", text)
return J, I # row, col index
# put the text pieces into the cells
for s in spans:
j, i = getcoord(fitz.Rect(s["bbox"]), s["text"])
cells[j][i] += s["text"] # append to stuff already in that cell
# -------------------------------------------------------------------------
# Step 3: Output the CSV file
# -------------------------------------------------------------------------
out = open("table.csv", "w")
for line in cells:
out.write(";".join(line) + "\n")
out.close()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.