如何使用 python（pdfminer，minecart，tabula ...）检测 PDF 文件中的彩色块

Question

I am trying to extract quite a few tables from a PDF file.我正在尝试从 PDF 文件中提取相当多的表格。 These tables are sort of conveniently "highlighted" with different colors, which makes it easy for eyes to catch (see the example screenshot).这些表有点方便地用不同的 colors “突出显示”，这使得眼睛很容易捕捉（参见示例屏幕截图）。

I think it would be good to detect the position/coordinates of those colored blocks, and use the coordinates to extract tables.我认为检测那些彩色块的位置/坐标并使用坐标提取表格会很好。

I have figured out the table extraction part (using tabula-py).我已经弄清楚了表格提取部分（使用 tabula-py）。 So it is the first step stopping me.所以这是阻止我的第一步。 From what I gathered minecart is the best tool for color and shapes in PDF files, except full scale imaging processing with OpenCV. But I have no luck with detecting colored box/block coordinates.据我所知，minecart 是 PDF 文件中颜色和形状的最佳工具，但 OpenCV 的全尺寸图像处理除外。但我没有检测彩色框/块坐标的运气。

Would appreciate any help!!将不胜感激任何帮助！

示例页面 1

Answer 1

I think I got a solution:我想我有一个解决方案：

import minecart

pdffile = open(fn, 'rb')
doc = minecart.Document(pdffile)
page = doc.get_page(page_num) # page_num is 0-based

for shape in page.shapes.iter_in_bbox((0, 0, 612, 792 )):
    if shape.fill: 
        shape_bbox = shape.get_bbox()
        shape_color = shape.fill.color.as_rgb()
        print(shape_bbox, shape_color)

I would then need to filter the color or the shape size...然后我需要过滤颜色或形状大小......

My earlier failure was due to having used a wrong page number:(我之前的失败是因为使用了错误的页码:(

Answer 2

PyMuPDF lets you extract so-called "line art": the vector drawings on a page. PyMuPDF 允许您提取所谓的“艺术线条”：页面上的矢量图。 This is a list of dictionaries of "paths" (as PDF calls interconnected drawings) from which you can sub-select ones of interest for you.这是一个“路径”词典列表（如 PDF 调用互连绘图），您可以从中为您选择感兴趣的子项。 Eg the following identifies drawings that represent filled rectangles, not too small:例如，以下标识代表填充矩形的绘图，不是太小：

page = doc[0]  # load some page (here page 0)
paths = page.get_drawings()  # extract all vector graphics
filled_rects = [] # filled rectangles without border land here
for path in paths:
    if path["type"] != "f"  # only consider paths with a fill color
        continue
    rect = path["rect"]
    if rect.width < 20 or rect.height < 20:  # only consider sizable rects
        continue
    filled_rects.append(rect)  # hopefully an area coloring a table
# make a visible border around the hits to see success:
for rect in filled_rects:
    page.draw_rect(rect, color=fitz.pdfcolor["red"])
doc.save("debug.pdf")

如何使用 python（pdfminer，minecart，tabula ...）检测 PDF 文件中的彩色块

问题描述

2 个解决方案

解决方案1
0 2023-01-11 18:43:51

解决方案2
0 2023-01-11 18:45:09

如何使用 python（pdfminer，minecart，tabula ...）检测 PDF 文件中的彩色块

问题描述

2 个解决方案

解决方案1 0 2023-01-11 18:43:51

解决方案2 0 2023-01-11 18:45:09

解决方案1
0 2023-01-11 18:43:51

解决方案2
0 2023-01-11 18:45:09