使用 Python docx2txt 从 Word 文档中提取图像

Question

I am trying to use docx2txt to extract a bunch of images from the same number of word documents (ie each word document has one image saved in it, and nothing else; don't ask me how I ended up here).我正在尝试使用 docx2txt 从相同数量的 word 文档中提取一堆图像（即每个 word 文档中都保存了一个图像，没有别的；不要问我是怎么到这里的）。 The problem I'm encountering is that the function "process" in docx2txt saves every first image from a particular word file as "image1," the second as "image2," etc. Since I'm iterating through a list of word documents, every time it tries to find an image in the next word document, it saves over the previously titled "image1".我遇到的问题是docx2txt中的function“进程”将特定word文件中的每个第一张图像保存为“image1”，第二张保存为“image2”等。因为我正在遍历word文档列表，每次它试图在下一个 word 文档中查找图像时，它都会保存先前标题为“image1”的图像。 My question: is there any way to avoid this issue using the docx2txt package?我的问题：有没有办法使用 docx2txt package 来避免这个问题？ I've read through their documentation, and it's pretty scarce and does not seem to indicate a way to change the name of the image files you save (ie instead of defaulting to "image1," I might be able to save it as "image_n" for n in my list range. Below is my code. Any suggestions/links to further reading would be sincerely appreciated.我已经阅读了他们的文档，它非常稀缺，并且似乎没有指示更改您保存的图像文件名称的方法（即，而不是默认为“image1”，我也许可以将其保存为“image_n " 对于我的列表范围内的 n。以下是我的代码。任何建议/进一步阅读的链接将不胜感激。

import docx2txt
import os

path ="whatever the path is"
savepath = "wherever one would want to save this"

files = []
for file in os.listdir(path):
    if file.endswith('.docx'):
        files.append(file) 

for i in range(len(files)):
    image = docx2txt.process(path+ "/" +files[i], savepath) ## this is the line that overwrites each new image

I understand why it doesn't work, but there doesn't seem to be another way to handle saving images with this package.我理解为什么它不起作用，但似乎没有另一种方法可以使用此 package 处理保存图像。 Once again, any suggestions would be appreciated.再次，任何建议将不胜感激。

(PS: I've already looked at other questions regarding this issue on SO, but they seem to be focused on extracting multiple images from one document, not a single image from multiple documents.) （PS：我已经在 SO 上查看了有关此问题的其他问题，但他们似乎专注于从一个文档中提取多个图像，而不是从多个文档中提取单个图像。）

Answer 1

https://github.com/ankushshah89/python-docx2txt/blob/c94663234d2882aa75932f9c9973eb5a804df13b/docx2txt/docx2txt.py#L72 https://github.com/ankushshah89/python-docx2txt/blob/c94663234d2882aa75932f9c9973eb5a804df13b/docx2txt/docx2txt.py#L72

it specifies directory, so instead它指定目录，所以改为

for i in range(len(files)):
    image = docx2txt.process(path+ "/" +files[i], savepath) ## this is the line that overwrites each new image

you could specify a separate save path你可以指定一个单独的保存路径

for i in range(len(files)):
    savepath=savepath+str(i)
    image = docx2txt.process(path+ "/" +files[i], savepath) ## this is the line that overwrites each new image

Answer 2

My eventual solution: I ended up saving each image in a folder corresponding to the string that also existed in the word document using the following:我最终的解决方案：我最终将每个图像保存在与 word 文档中也存在的字符串相对应的文件夹中，使用以下命令：

import docx2txt
import os

path ="path"
savepath = "savepath"

## Collects name information from the word files

files = []
correctedfiles = []
for file in os.listdir(path):
    if file.endswith('.docx'):
        files.append(file)

## Checking above    

for x in range(len(files)):
    print(files[x])

## Makes equal folders as exist questions and names them the by their ID

for i in range(len(files)):
    os.chdir(savepath)
    textresult = docx2txt.process(path + "/" + files[i])
    print(textresult)
    correctresult = textresult.replace('June 2019 ', '')
    os.system('mkdir ' + correctresult)

## Saves images based on name in folders     

for i in range(len(files)):
    textresult = docx2txt.process(path + "/" + files[i])
    correctresult = textresult.replace('June 2019 ', '')
    image = docx2txt.process(path+ "/" +files[i], savepath + '/' + correctresult)

There are some extra bits dealing with excess words in the title of the word docs (the 'June 2019 ' stuff).在单词 docs 的标题（'June 2019' 的东西）中有一些额外的位处理多余的单词。

使用 Python docx2txt 从 Word 文档中提取图像

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-12-16 02:37:09

解决方案2
0 2020-12-16 21:58:26

使用 Python docx2txt 从 Word 文档中提取图像

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-12-16 02:37:09

解决方案2 0 2020-12-16 21:58:26

解决方案1
1 已采纳 2020-12-16 02:37:09

解决方案2
0 2020-12-16 21:58:26