使用 Python 从 word 文档中提取图像

Question

如何使用 python 从 word 文档中提取图像/徽标并将它们存储在文件夹中。 以下代码将 docx 转换为 html，但它不会从 html 中提取图像。 任何指针/建议都会有很大帮助。

    profile_path = <file path>
    result=mammoth.convert_to_html( profile_path)
    f = open(profile_path, 'rb')
    b = open(profile_html, 'wb')
    document = mammoth.convert_to_html(f)
    b.write(document.value.encode('utf8'))
    f.close()
    b.close()

Answer 1

您可以使用docx2txt库，它会读取您的 .docx 文档并将图像导出到您指定的目录（必须存在）。

!pip install docx2txt
import docx2txt
text = docx2txt.process("/path/your_word_doc.docx", '/home/example/img/')

执行后，您将在/home/example/img/ 中获得图像，而变量text将具有文档文本。 它们将按外观顺序命名为 image1.png ... imageN.png。

注意：Word 文档必须为 .docx 格式。

Answer 2

使用python提取docx文件中的所有图像

1. 使用 docxtxt

import docx2txt
#extract text 
text = docx2txt.process(r"filepath_of_docx")
#extract text and write images in Temporary Image directory
text = docx2txt.process(r"filepath_of_docx",r"Temporary_Image_Directory")

2. 使用 aspose

import aspose.words as aw
# load the Word document
doc = aw.Document(r"filepath")
# retrieve all shapes
shapes = doc.get_child_nodes(aw.NodeType.SHAPE, True)
imageIndex = 0
# loop through shapes
for shape in shapes :
    shape = shape.as_shape()
    if (shape.has_image) :
        # set image file's name
        imageFileName = f"Image.ExportImages.{imageIndex}_{aw.FileFormatUtil.image_type_to_extension(shape.image_data.image_type)}"
        # save image
        shape.image_data.save(imageFileName)
        imageIndex += 1

Answer 3

本机没有任何库

从 docx（它是 zip 文件的变体）中提取源图像而不会失真或转换。

外壳到操作系统并运行

tar -m -xf DocxWithImages.docx word/media

您将在 word media 文件夹中找到 Jpeg、PNG WMF 或其他源图像，提取到该名称的文件夹中。 这些是没有规模或裁剪的纯源嵌入。

您可能会惊讶于可见区域可能比 docx 本身中使用的任何裁剪版本都大，因此需要注意 Word 并不总是按预期裁剪图像（令人尴尬的编辑失败的根源）

Answer 4

查看Alderven 在使用 python 提取 docx 文件中的所有图像的答案

zipfile适用于比docx2txt更多的图像格式。 例如，EMF 图像不能通过docx2txt提取，但可以通过zipfile提取。

使用 Python 从 word 文档中提取图像

问题描述

4 个解决方案

解决方案1
2 2019-12-17 15:06:35

解决方案2
0 2021-11-22 12:17:40

使用python提取docx文件中的所有图像

1. 使用 docxtxt

2. 使用 aspose

解决方案3
0 2021-11-22 18:04:43

本机没有任何库

解决方案4
0 2022-08-10 11:24:48

使用 Python 从 word 文档中提取图像

问题描述

4 个解决方案

解决方案1 2 2019-12-17 15:06:35

解决方案2 0 2021-11-22 12:17:40

使用python提取docx文件中的所有图像

1. 使用 docxtxt

2. 使用 aspose

解决方案3 0 2021-11-22 18:04:43

本机没有任何库

解决方案4 0 2022-08-10 11:24:48

解决方案1
2 2019-12-17 15:06:35

解决方案2
0 2021-11-22 12:17:40

解决方案3
0 2021-11-22 18:04:43

解决方案4
0 2022-08-10 11:24:48