使用 Python docx2txt 從 Word 文檔中提取圖像

Question

我正在嘗試使用 docx2txt 從相同數量的 word 文檔中提取一堆圖像（即每個 word 文檔中都保存了一個圖像，沒有別的；不要問我是怎么到這里的）。 我遇到的問題是docx2txt中的function“進程”將特定word文件中的每個第一張圖像保存為“image1”，第二張保存為“image2”等。因為我正在遍歷word文檔列表，每次它試圖在下一個 word 文檔中查找圖像時，它都會保存先前標題為“image1”的圖像。 我的問題：有沒有辦法使用 docx2txt package 來避免這個問題？ 我已經閱讀了他們的文檔，它非常稀缺，並且似乎沒有指示更改您保存的圖像文件名稱的方法（即，而不是默認為“image1”，我也許可以將其保存為“image_n " 對於我的列表范圍內的 n。以下是我的代碼。任何建議/進一步閱讀的鏈接將不勝感激。

import docx2txt
import os

path ="whatever the path is"
savepath = "wherever one would want to save this"

files = []
for file in os.listdir(path):
    if file.endswith('.docx'):
        files.append(file) 

for i in range(len(files)):
    image = docx2txt.process(path+ "/" +files[i], savepath) ## this is the line that overwrites each new image

我理解為什么它不起作用，但似乎沒有另一種方法可以使用此 package 處理保存圖像。 再次，任何建議將不勝感激。

（PS：我已經在 SO 上查看了有關此問題的其他問題，但他們似乎專注於從一個文檔中提取多個圖像，而不是從多個文檔中提取單個圖像。）

Answer 1

https://github.com/ankushshah89/python-docx2txt/blob/c94663234d2882aa75932f9c9973eb5a804df13b/docx2txt/docx2txt.py#L72

它指定目錄，所以改為

for i in range(len(files)):
    image = docx2txt.process(path+ "/" +files[i], savepath) ## this is the line that overwrites each new image

你可以指定一個單獨的保存路徑

for i in range(len(files)):
    savepath=savepath+str(i)
    image = docx2txt.process(path+ "/" +files[i], savepath) ## this is the line that overwrites each new image

Answer 2

我最終的解決方案：我最終將每個圖像保存在與 word 文檔中也存在的字符串相對應的文件夾中，使用以下命令：

import docx2txt
import os

path ="path"
savepath = "savepath"

## Collects name information from the word files

files = []
correctedfiles = []
for file in os.listdir(path):
    if file.endswith('.docx'):
        files.append(file)

## Checking above    

for x in range(len(files)):
    print(files[x])

## Makes equal folders as exist questions and names them the by their ID

for i in range(len(files)):
    os.chdir(savepath)
    textresult = docx2txt.process(path + "/" + files[i])
    print(textresult)
    correctresult = textresult.replace('June 2019 ', '')
    os.system('mkdir ' + correctresult)

## Saves images based on name in folders     

for i in range(len(files)):
    textresult = docx2txt.process(path + "/" + files[i])
    correctresult = textresult.replace('June 2019 ', '')
    image = docx2txt.process(path+ "/" +files[i], savepath + '/' + correctresult)

在單詞 docs 的標題（'June 2019' 的東西）中有一些額外的位處理多余的單詞。

使用 Python docx2txt 從 Word 文檔中提取圖像

問題描述

2 個解決方案

解決方案1
1 已采納 2020-12-16 02:37:09

解決方案2
0 2020-12-16 21:58:26

使用 Python docx2txt 從 Word 文檔中提取圖像

問題描述

2 個解決方案

解決方案1 1 已采納 2020-12-16 02:37:09

解決方案2 0 2020-12-16 21:58:26

解決方案1
1 已采納 2020-12-16 02:37:09

解決方案2
0 2020-12-16 21:58:26