简体   繁体   中英

Extract Text in Natural reading order using pymupdf (fitz)

I am trying to extract the text using pymupdf or flitz by applying this tutorial https://towardsdatascience.com/extracting-headers-and-paragraphs-from-pdf-using-pymupdf-676e8421c467

instead of blocks = page.getText("dict")["blocks"]

I wrote blocks = page.get_text("dict", sort=True)["blocks"]

according to https://pymupdf.readthedocs.io/en/latest/recipes-text.html

But still, the text is not in the order I expect. The first paragraph will appear in the middle.

This happens when a page has more than one column of text.

You made a good first step using the sort argument. But please note that PDF can address each single character separately, such that every basic sorting approach may fail with the "right" PDF counter example.

If a page contains n text characters, then there exist n! different ways to encode the page - all of them looking identical, but only one of them extracting the "natural" reading sequence right away.

If your page contains tables, or if the text is organized in multiple columns (as is customary in newspapers), then you must invest additional logic to cope with that.

If you use the PyMuPDF module, you can extract text in a layout preserving manner: python -m fitz gettext -mode layout... .

If you need to achieve a similar effect within your script, you may be forced to use text extraction detailed down to each single character: page.get_text("rawdict") and use the returned character positions to bring them in the right sequence.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM