简体繁体中英

How to extract images from PDF or Word, together with the text around images?

原文 2019-04-09 09:15:48 8 3 python/ shell/ pdf/ ms-word/ image-extraction

I found there are some library for extracting images from PDF or word, like docx2txt and pdfimages. But how can I get the content around the images (like there may be a title below the image)? Or get a page number of each image？

Some other tools like PyPDF2 and minecart can extract image page by page. However, I cannot run those code successfully.

Is there a good way to get some information of the images? (from the image got from docx2txt or pdfimages, or another way to extract image with info)

3 answers

I found the code of doc2txt and it's simply parse the xml of docx file. So it's actually an very easy task..

Ref: doc2txt

docx2python pulls the images into a folder and leaves -----image1.png---- markers in the extracted text. This might get you close to where you'd like to go.

Few month ago, I reprogramed docx2python to reproducing a structured(with level) xml format file from a docx file, which works out pretty good on many files.

As far as I know, a paragraph contains several Runs and each Run contain one only text, sometimes contains images. You can read this document for details. https://docs.microsoft.com/en-us/dotnet/api/documentformat.openxml.wordprocessing.paragraph?view=openxml-2.8.1 .

docx2python support extracting image with text around it. You use docx2python reading paragraphes, while ----media/imagen---- shows in your text, which is a image placeholder. Then you can reach this image if you set extract_image=True . Well, you get what your image called in pagaraph text and list of image files. Match as you like.

How extract text from PDF including images and text

how to extract text from images in a pdf file using pytesseract

Extract text from a scanned pdf with images?

How to extract images from a PDF in pure Python?

Extract images from a pdf as pdfs

Using Python to extract images and text from a word document

Pdf miner how to extract images

Using Python, how to extract text and images from PDF + color strings and numbers from the output txt file

How to extract text from these colored images?

How to extract images from a Text in python (regex)

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How extract text from PDF including images and text how to extract text from images in a pdf file using pytesseract Extract text from a scanned pdf with images? How to extract images from a PDF in pure Python? Extract images from a pdf as pdfs Using Python to extract images and text from a word document Pdf miner how to extract images Using Python, how to extract text and images from PDF + color strings and numbers from the output txt file How to extract text from these colored images? How to extract images from a Text in python (regex)

Related Tags

How to extract images from PDF or Word, together with the text around images?

Question

3 answers

solution1
0 2019-04-12 13:46:31

solution2
0 2019-07-10 21:10:56

solution3
0 2022-01-06 02:44:49

How to extract images from PDF or Word, together with the text around images?

Question

3 answers

solution1 0 2019-04-12 13:46:31

solution2 0 2019-07-10 21:10:56

solution3 0 2022-01-06 02:44:49

solution1
0 2019-04-12 13:46:31

solution2
0 2019-07-10 21:10:56

solution3
0 2022-01-06 02:44:49