简体   繁体   English

如何检测pdf是否是python中的一栏和两栏

[英]How to detect if a pdf is a one-column and two-column in python

So I work in NLP with hundreds of PDFs and the thing I hate is that since there is no one way of writing PDF I have to write a script to handle that (almost all the script is for the two-column PDF with tables and other weird stuff) and when I input one column one it gets messed up.所以我在 NLP 中使用数百个 PDF 工作,我讨厌的是,由于没有一种编写 PDF 的方法,我必须编写一个脚本来处理它(几乎所有脚本都是用于带有表格和其他的两列 PDF奇怪的东西),当我输入一列时,它就搞砸了。 Is there any way to detect if a PDF is one or two-column and run the fixing script only for the two column one after that?有什么方法可以检测 PDF 是一列还是两列,然后只为第一列的两列运行修复脚本? Please help me out with this.这个你能帮我吗。

This is what the PDFs look like这就是 PDF 的样子
One column PDF 一栏PDF
Two column PDF 两栏 PDF

disclaimer : I am the author of borb , the library used in this answer免责声明:我是这个答案中使用的库borb的作者

borb has several classes that process PDF documents. borb有几个处理 PDF 文档的类。 These all implement EventListener .这些都实现了EventListener The idea is that they listen to the processing of pdf syntax, and process events (eg: an image has been rendered, a string was rendered, a new page has started, etc).这个想法是他们监听pdf 语法的处理,并处理事件(例如:图像已被渲染,字符串已被渲染,新页面已启动等)。

One of these implementations is SimpleParagraphExtraction .这些实现之一是SimpleParagraphExtraction It attempts to use geometric information to determine which text should be separated from other text, and when something makes up a line of text, and when several lines make up a paragraph.它尝试使用几何信息来确定应该将哪些文本与其他文本分开,以及什么时候组成一行文本,以及几行组成一个段落。

This is how you'd use it:这是你将如何使用它:


        # read document
        l: SimpleParagraphExtraction = SimpleParagraphExtraction(maximum_multiplied_leading=Decimal(1.7))
        doc: typing.Optional[Document] = None
        with open("input.pdf", "rb") as pdf_file_handle:
            doc = PDF.loads(pdf_file_handle, [l])

Once you've processed the PDF, you can now do something with the paragraphs you've detected.处理完 PDF 后,您现在可以对检测到的段落进行处理。

        for p in l.get_paragraphs_for_page(0):
            doc.get_page(0).add_annotation(
                SquareAnnotation(p.get_bounding_box(), stroke_color=HexColor("f1cd2e"))
            )

The above code adds a colored rectangle around each paragraph.上面的代码在每个段落周围添加了一个彩色矩形。

You can easily modify this code to determine how many paragraphs appear side-by-side.您可以轻松修改此代码以确定并排显示的段落数。 Which should help you determine whether something is single- or multi-column layout.这应该可以帮助您确定某些内容是单列布局还是多列布局。

edit: This is a quick write-up I did:编辑:这是我做的快速记录:

from pathlib import Path
import typing
from borb.pdf.pdf import PDF
from borb.toolkit.text.simple_paragraph_extraction import SimpleParagraphExtraction
from borb.pdf.canvas.layout.annotation.square_annotation import SquareAnnotation
from borb.pdf import HexColor
from borb.pdf import Paragraph
from decimal import Decimal
import requests

open("example_001.pdf", "wb").write(requests.get("https://github.com/Da-vid21/Outputs/raw/main/BarCvDescLJ11.pdf").content)
open("example_002.pdf", "wb").write(requests.get("https://github.com/Da-vid21/Outputs/raw/main/Bill-Brown-Reprint.pdf").content)

# open PDF
l: SimpleParagraphExtraction = SimpleParagraphExtraction(maximum_multiplied_leading=1.6)
with open("example_002.pdf", "rb") as fh:
  doc = PDF.loads(fh, [l])

# build histogram (number of paragraphs per y-coordinate)
ps: typing.List[Paragraph] = l.get_paragraphs_for_page(0)
h: typing.Dict[int, int] = {}
for p in ps:
  y0: int = int(p.get_bounding_box().get_y())
  y1: int = int(y0 + p.get_bounding_box().get_height())
  for y in range(y0, y1):
    h[y] = h.get(y, 0) + 1

# display average
avg_paras_per_y: float = sum([x for x in h.values()]) / len(h)
print(avg_paras_per_y)

This outputs:这输出:

1.5903010033444815

On average, your two-column document has 1.6 paragraphs per y-coordinate.平均而言,您的两列文档每个 y 坐标有 1.6 个段落。 That would seem to indicate it's a two-column layout.这似乎表明它是一个两列布局。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM