简体   繁体   English

如何使用 python(pdfminer,minecart,tabula ...)检测 PDF 文件中的彩色块

[英]How to detect colored blocks in a PDF file with python (pdfminer, minecart, tabula...)

I am trying to extract quite a few tables from a PDF file.我正在尝试从 PDF 文件中提取相当多的表格。 These tables are sort of conveniently "highlighted" with different colors, which makes it easy for eyes to catch (see the example screenshot).这些表有点方便地用不同的 colors “突出显示”,这使得眼睛很容易捕捉(参见示例屏幕截图)。

I think it would be good to detect the position/coordinates of those colored blocks, and use the coordinates to extract tables.我认为检测那些彩色块的位置/坐标并使用坐标提取表格会很好。

I have figured out the table extraction part (using tabula-py).我已经弄清楚了表格提取部分(使用 tabula-py)。 So it is the first step stopping me.所以这是阻止我的第一步。 From what I gathered minecart is the best tool for color and shapes in PDF files, except full scale imaging processing with OpenCV. But I have no luck with detecting colored box/block coordinates.据我所知,minecart 是 PDF 文件中颜色和形状的最佳工具,但 OpenCV 的全尺寸图像处理除外。但我没有检测彩色框/块坐标的运气。

Would appreciate any help!!将不胜感激任何帮助!

示例页面 1

I think I got a solution:我想我有一个解决方案:

import minecart

pdffile = open(fn, 'rb')
doc = minecart.Document(pdffile)
page = doc.get_page(page_num) # page_num is 0-based

for shape in page.shapes.iter_in_bbox((0, 0, 612, 792 )):
    if shape.fill: 
        shape_bbox = shape.get_bbox()
        shape_color = shape.fill.color.as_rgb()
        print(shape_bbox, shape_color)

I would then need to filter the color or the shape size...然后我需要过滤颜色或形状大小......

My earlier failure was due to having used a wrong page number:(我之前的失败是因为使用了错误的页码:(

PyMuPDF lets you extract so-called "line art": the vector drawings on a page. PyMuPDF 允许您提取所谓的“艺术线条”:页面上的矢量图。 This is a list of dictionaries of "paths" (as PDF calls interconnected drawings) from which you can sub-select ones of interest for you.这是一个“路径”词典列表(如 PDF 调用互连绘图),您可以从中为您选择感兴趣的子项。 Eg the following identifies drawings that represent filled rectangles, not too small:例如,以下标识代表填充矩形的绘图,不是太小:

page = doc[0]  # load some page (here page 0)
paths = page.get_drawings()  # extract all vector graphics
filled_rects = [] # filled rectangles without border land here
for path in paths:
    if path["type"] != "f"  # only consider paths with a fill color
        continue
    rect = path["rect"]
    if rect.width < 20 or rect.height < 20:  # only consider sizable rects
        continue
    filled_rects.append(rect)  # hopefully an area coloring a table
# make a visible border around the hits to see success:
for rect in filled_rects:
    page.draw_rect(rect, color=fitz.pdfcolor["red"])
doc.save("debug.pdf")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM