简体   繁体   中英

How to extract images from PDF or Word, together with the text around images?

I found there are some library for extracting images from PDF or word, like docx2txt and pdfimages. But how can I get the content around the images (like there may be a title below the image)? Or get a page number of each image?

Some other tools like PyPDF2 and minecart can extract image page by page. However, I cannot run those code successfully.

Is there a good way to get some information of the images? (from the image got from docx2txt or pdfimages, or another way to extract image with info)

I found the code of doc2txt and it's simply parse the xml of docx file. So it's actually an very easy task..

Ref: doc2txt

docx2python pulls the images into a folder and leaves -----image1.png---- markers in the extracted text. This might get you close to where you'd like to go.

Few month ago, I reprogramed docx2python to reproducing a structured(with level) xml format file from a docx file, which works out pretty good on many files.

As far as I know, a paragraph contains several Runs and each Run contain one only text, sometimes contains images. You can read this document for details. https://docs.microsoft.com/en-us/dotnet/api/documentformat.openxml.wordprocessing.paragraph?view=openxml-2.8.1 .

docx2python support extracting image with text around it. You use docx2python reading paragraphes, while ----media/imagen---- shows in your text, which is a image placeholder. Then you can reach this image if you set extract_image=True . Well, you get what your image called in pagaraph text and list of image files. Match as you like.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM