[英]Convert PDF to Excel in Python
我使用 Python 將 pdf 文件轉換為 excel。但是,我的 pdf 文件中的某些行比其他行大,即變量名稱更長,go 進入下一行。
當我轉換 pdf 時,這些較長的字符串變量在我的 excel 文檔中重疊成多行。
有什么方法可以更高效、更准確地從 pdf 導入表?
我的代碼:
import tabula
file = r"mydirectory.pdf"
pdfData = tabula.read_pdf(file,pages="all") [0]
tabula.convert_into(file, r"mydirectory.csv", pages = "all")
以下腳本使用 PyMuPDF 並解決了這個問題。
雖然也存在更通用的表格布局的解決方案,但這個解決方案解決了單元格被線條邊框包裹的情況。
"""
Example script: Using PyMuPDF for table analysis
-------------------------------------------------
This script extracts cell content for a table on a PDF page and outputs
a CSV file representing the table.
The script will work successfully if the following conditions are met:
1. The table is / can be identified by a boundary box (rectangle).
2. Table has a clean (row x column) format.
3. Each table cell is wrapped by line drawings.
The script executes the following steps:
Step 0: Identify a clip rectangle containing the table. This may work by
identifying text keyword coordinates (the example presented here)
or by whatever other mechanism.
Step 1: Extract x- and y-coordinates of vector graphic lines. They are
being used as cell borders to determine the right cell for each
piece of text. Create a Python table with empty text cells.
Step 2: Extract page text pieces ("spans") within the clip and sort them
by vertical, then horizontal coordinates. Sorting is required to
ensure correct sequence of multi-line table cell text content.
For each text piece, append it to the respective cell text.
Step 3: Output Python table as CSV file.
"""
import fitz
# make minimal wrapping rectangles
fitz.Tools().set_small_glyph_heights(True)
doc = fitz.open("test.pdf")
page = doc[0] # first page
# -------------------------------------------------------------------------
# Step 0: Identify clip rectangle
# Look up top and bottom coordinates for relevant data
# -------------------------------------------------------------------------
top = page.search_for("Basic Project Information")[0].y1
bot = page.search_for("page")[0].y0
# so we extract info from the following rectangle
clip = fitz.Rect(0, top, page.rect.width, bot)
# -------------------------------------------------------------------------
# Step 1: Compute x-, y-coordinates of cell borders
# Find table border line coordinates
# -------------------------------------------------------------------------
paths = page.get_drawings() # all line art
vert = set() # vertical (x-) coordinates
hori = set() # horizontal (y-) coordinates
for p in paths: # walk thru vector graphis to find the lines
if p["rect"].y0 < top or p["rect"].y1 > bot: # omit stuff outside clip
continue
for item in p["items"]: # look at lines and "thin" rectangles
if item[0] == "l": # a line
p1, p2 = item[1:]
if p1.x == p2.x: # vertical line
vert.add(p1.x) # store column border
elif p1.y == p2.y: # horizontal line
hori.add(p1.y) # store row border
elif item[0] == "re": # a rectangle item
rect = item[1] # rect coordinates
if rect.width <= 3 and rect.height > 10:
vert.add(rect.x0) # thin vertical rect: treat like col line
elif rect.height <= 3 and rect.width > 10:
hori.add(rect.y1) # treat like row line
vert = sorted(list(vert)) # sorted, without duplicates
hori = sorted(list(hori)) # sorted, without duplicates
# Define table cells with these values:
# * has len(hori)-1 rows
# * every row has len(vert)-1 columns
cells = [[""] * (len(vert) - 1) for j in range(len(hori) - 1)]
# -------------------------------------------------------------------------
# Step 2: Extract text spans
# Extract and sort text spans. We use the "dict" output format.
# -------------------------------------------------------------------------
# read text with all details into this list
spans = []
text = page.get_text("dict", flags=fitz.TEXTFLAGS_TEXT, clip=clip)
for block in text["blocks"]:
for line in block["lines"]:
for span in line["spans"]:
spans.append(span) # is text dict within whatever cell
spans.sort(key=lambda s: (s["bbox"][3], s["bbox"][0]))
def getcoord(bbox, text):
"""Find row / col index for given text rect."""
I = -1 # row index
J = -1 # col index
for i in range(len(vert) - 1):
if vert[i] <= bbox.x0 < bbox.x1 <= vert[i + 1]:
I = i
break
for j in range(len(hori) - 1):
if hori[j] <= bbox.y0 < bbox.y1 <= hori[j + 1]:
J = j
break
if I < 0 or J < 0: # shouldn't happen: correct cell not found
raise ValueError(J, I, "PROBLEM", text)
return J, I # row, col index
# put the text pieces into the cells
for s in spans:
j, i = getcoord(fitz.Rect(s["bbox"]), s["text"])
cells[j][i] += s["text"] # append to stuff already in that cell
# -------------------------------------------------------------------------
# Step 3: Output the CSV file
# -------------------------------------------------------------------------
out = open("table.csv", "w")
for line in cells:
out.write(";".join(line) + "\n")
out.close()
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.