简体   繁体   中英

How to detect if a pdf is a one-column and two-column in python

So I work in NLP with hundreds of PDFs and the thing I hate is that since there is no one way of writing PDF I have to write a script to handle that (almost all the script is for the two-column PDF with tables and other weird stuff) and when I input one column one it gets messed up. Is there any way to detect if a PDF is one or two-column and run the fixing script only for the two column one after that? Please help me out with this.

This is what the PDFs look like
One column PDF
Two column PDF

disclaimer : I am the author of borb , the library used in this answer

borb has several classes that process PDF documents. These all implement EventListener . The idea is that they listen to the processing of pdf syntax, and process events (eg: an image has been rendered, a string was rendered, a new page has started, etc).

One of these implementations is SimpleParagraphExtraction . It attempts to use geometric information to determine which text should be separated from other text, and when something makes up a line of text, and when several lines make up a paragraph.

This is how you'd use it:


        # read document
        l: SimpleParagraphExtraction = SimpleParagraphExtraction(maximum_multiplied_leading=Decimal(1.7))
        doc: typing.Optional[Document] = None
        with open("input.pdf", "rb") as pdf_file_handle:
            doc = PDF.loads(pdf_file_handle, [l])

Once you've processed the PDF, you can now do something with the paragraphs you've detected.

        for p in l.get_paragraphs_for_page(0):
            doc.get_page(0).add_annotation(
                SquareAnnotation(p.get_bounding_box(), stroke_color=HexColor("f1cd2e"))
            )

The above code adds a colored rectangle around each paragraph.

You can easily modify this code to determine how many paragraphs appear side-by-side. Which should help you determine whether something is single- or multi-column layout.

edit: This is a quick write-up I did:

from pathlib import Path
import typing
from borb.pdf.pdf import PDF
from borb.toolkit.text.simple_paragraph_extraction import SimpleParagraphExtraction
from borb.pdf.canvas.layout.annotation.square_annotation import SquareAnnotation
from borb.pdf import HexColor
from borb.pdf import Paragraph
from decimal import Decimal
import requests

open("example_001.pdf", "wb").write(requests.get("https://github.com/Da-vid21/Outputs/raw/main/BarCvDescLJ11.pdf").content)
open("example_002.pdf", "wb").write(requests.get("https://github.com/Da-vid21/Outputs/raw/main/Bill-Brown-Reprint.pdf").content)

# open PDF
l: SimpleParagraphExtraction = SimpleParagraphExtraction(maximum_multiplied_leading=1.6)
with open("example_002.pdf", "rb") as fh:
  doc = PDF.loads(fh, [l])

# build histogram (number of paragraphs per y-coordinate)
ps: typing.List[Paragraph] = l.get_paragraphs_for_page(0)
h: typing.Dict[int, int] = {}
for p in ps:
  y0: int = int(p.get_bounding_box().get_y())
  y1: int = int(y0 + p.get_bounding_box().get_height())
  for y in range(y0, y1):
    h[y] = h.get(y, 0) + 1

# display average
avg_paras_per_y: float = sum([x for x in h.values()]) / len(h)
print(avg_paras_per_y)

This outputs:

1.5903010033444815

On average, your two-column document has 1.6 paragraphs per y-coordinate. That would seem to indicate it's a two-column layout.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM