[英]Extracting Images from Word Documents Using Python docx2txt
I am trying to use docx2txt to extract a bunch of images from the same number of word documents (ie each word document has one image saved in it, and nothing else; don't ask me how I ended up here).我正在尝试使用 docx2txt 从相同数量的 word 文档中提取一堆图像(即每个 word 文档中都保存了一个图像,没有别的;不要问我是怎么到这里的)。 The problem I'm encountering is that the function "process" in docx2txt saves every first image from a particular word file as "image1," the second as "image2," etc. Since I'm iterating through a list of word documents, every time it tries to find an image in the next word document, it saves over the previously titled "image1".
我遇到的问题是docx2txt中的function“进程”将特定word文件中的每个第一张图像保存为“image1”,第二张保存为“image2”等。因为我正在遍历word文档列表,每次它试图在下一个 word 文档中查找图像时,它都会保存先前标题为“image1”的图像。 My question: is there any way to avoid this issue using the docx2txt package?
我的问题:有没有办法使用 docx2txt package 来避免这个问题? I've read through their documentation, and it's pretty scarce and does not seem to indicate a way to change the name of the image files you save (ie instead of defaulting to "image1," I might be able to save it as "image_n" for n in my list range. Below is my code. Any suggestions/links to further reading would be sincerely appreciated.
我已经阅读了他们的文档,它非常稀缺,并且似乎没有指示更改您保存的图像文件名称的方法(即,而不是默认为“image1”,我也许可以将其保存为“image_n " 对于我的列表范围内的 n。以下是我的代码。任何建议/进一步阅读的链接将不胜感激。
import docx2txt
import os
path ="whatever the path is"
savepath = "wherever one would want to save this"
files = []
for file in os.listdir(path):
if file.endswith('.docx'):
files.append(file)
for i in range(len(files)):
image = docx2txt.process(path+ "/" +files[i], savepath) ## this is the line that overwrites each new image
I understand why it doesn't work, but there doesn't seem to be another way to handle saving images with this package.我理解为什么它不起作用,但似乎没有另一种方法可以使用此 package 处理保存图像。 Once again, any suggestions would be appreciated.
再次,任何建议将不胜感激。
(PS: I've already looked at other questions regarding this issue on SO, but they seem to be focused on extracting multiple images from one document, not a single image from multiple documents.) (PS:我已经在 SO 上查看了有关此问题的其他问题,但他们似乎专注于从一个文档中提取多个图像,而不是从多个文档中提取单个图像。)
https://github.com/ankushshah89/python-docx2txt/blob/c94663234d2882aa75932f9c9973eb5a804df13b/docx2txt/docx2txt.py#L72 https://github.com/ankushshah89/python-docx2txt/blob/c94663234d2882aa75932f9c9973eb5a804df13b/docx2txt/docx2txt.py#L72
it specifies directory, so instead它指定目录,所以改为
for i in range(len(files)):
image = docx2txt.process(path+ "/" +files[i], savepath) ## this is the line that overwrites each new image
you could specify a separate save path你可以指定一个单独的保存路径
for i in range(len(files)):
savepath=savepath+str(i)
image = docx2txt.process(path+ "/" +files[i], savepath) ## this is the line that overwrites each new image
My eventual solution: I ended up saving each image in a folder corresponding to the string that also existed in the word document using the following:我最终的解决方案:我最终将每个图像保存在与 word 文档中也存在的字符串相对应的文件夹中,使用以下命令:
import docx2txt
import os
path ="path"
savepath = "savepath"
## Collects name information from the word files
files = []
correctedfiles = []
for file in os.listdir(path):
if file.endswith('.docx'):
files.append(file)
## Checking above
for x in range(len(files)):
print(files[x])
## Makes equal folders as exist questions and names them the by their ID
for i in range(len(files)):
os.chdir(savepath)
textresult = docx2txt.process(path + "/" + files[i])
print(textresult)
correctresult = textresult.replace('June 2019 ', '')
os.system('mkdir ' + correctresult)
## Saves images based on name in folders
for i in range(len(files)):
textresult = docx2txt.process(path + "/" + files[i])
correctresult = textresult.replace('June 2019 ', '')
image = docx2txt.process(path+ "/" +files[i], savepath + '/' + correctresult)
There are some extra bits dealing with excess words in the title of the word docs (the 'June 2019 ' stuff).在单词 docs 的标题('June 2019' 的东西)中有一些额外的位处理多余的单词。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.