简体繁体中英

Extract Text in Natural reading order using pymupdf (fitz)

原文 2022-12-20 02:33:36 8 1 python/ pdf/ text-extraction/ pymupdf

I am trying to extract the text using pymupdf or flitz by applying this tutorial https://towardsdatascience.com/extracting-headers-and-paragraphs-from-pdf-using-pymupdf-676e8421c467

instead of blocks = page.getText("dict")["blocks"]

I wrote blocks = page.get_text("dict", sort=True)["blocks"]

according to https://pymupdf.readthedocs.io/en/latest/recipes-text.html

But still, the text is not in the order I expect. The first paragraph will appear in the middle.

This happens when a page has more than one column of text.

1 answers

You made a good first step using the sort argument. But please note that PDF can address each single character separately, such that every basic sorting approach may fail with the "right" PDF counter example.

If a page contains n text characters, then there exist n! different ways to encode the page - all of them looking identical, but only one of them extracting the "natural" reading sequence right away.

If your page contains tables, or if the text is organized in multiple columns (as is customary in newspapers), then you must invest additional logic to cope with that.

If you use the PyMuPDF module, you can extract text in a layout preserving manner: python -m fitz gettext -mode layout... .

If you need to achieve a similar effect within your script, you may be forced to use text extraction detailed down to each single character: page.get_text("rawdict") and use the returned character positions to bring them in the right sequence.

Python PyMuPDF Fitz insertImage

how to extract text from a selection of pages in a larger pdf using pymupdf?

Using python PyMuPDF (fitz) to iterate through lines and check length of it and add a period if it meets the criteria

fitz.open() not working when in a for loop (FITZ,PYTHON,PYMUPDF)

Python PyMuPDF / Fitz rotates image from extractImage

Saving a pymupdf fitz object to s3 as a pdf

adding text to a pdf using PyMuPDF

how to delete a text layer using fitz?

Can a text be searched Blockwise in a PDF using PyMuPDF?

Delete text from pdf using PyMUPDF

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Python PyMuPDF Fitz insertImage how to extract text from a selection of pages in a larger pdf using pymupdf? Using python PyMuPDF (fitz) to iterate through lines and check length of it and add a period if it meets the criteria fitz.open() not working when in a for loop (FITZ,PYTHON,PYMUPDF) Python PyMuPDF / Fitz rotates image from extractImage Saving a pymupdf fitz object to s3 as a pdf adding text to a pdf using PyMuPDF how to delete a text layer using fitz? Can a text be searched Blockwise in a PDF using PyMuPDF? Delete text from pdf using PyMUPDF

Related Tags

Extract Text in Natural reading order using pymupdf (fitz)

Question

1 answers

solution1 1 2022-12-26 02:10:19

solution1
1 2022-12-26 02:10:19