如何使用 Python 從 doc/docx 文件中提取數據

Question

我知道那里有類似的問題，但我找不到能回答我祈禱的東西。 我需要的是一種從 MS-Word 文件訪問某些數據並將其保存在 XML 文件中的方法。 閱讀python-docx並沒有幫助，因為它似乎只允許一個人寫入 word 文檔，而不是閱讀。 准確地展示我的任務（或者我選擇如何完成我的任務）：我想在文檔中搜索關鍵字或短語（文檔包含表格）並從關鍵字/短語所在的表格中提取文本數據成立。 有人有什么想法嗎？

Answer 1

docx 是一個包含文檔 XML 的 zip 文件。 您可以打開 zip，閱讀文檔並使用 ElementTree 解析數據。

這種技術的優點是您不需要安裝任何額外的 python 庫。

import zipfile
import xml.etree.ElementTree

WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'
TABLE = WORD_NAMESPACE + 'tbl'
ROW = WORD_NAMESPACE + 'tr'
CELL = WORD_NAMESPACE + 'tc'

with zipfile.ZipFile('<path to docx file>') as docx:
    tree = xml.etree.ElementTree.XML(docx.read('word/document.xml'))

for table in tree.iter(TABLE):
    for row in table.iter(ROW):
        for cell in row.iter(CELL):
            print ''.join(node.text for node in cell.iter(TEXT))

請參閱我的 stackoverflow 回答如何使用 Python 讀取 MS-Word 文件中表格的內容？ 有關更多詳細信息和參考。

在回答下面的評論時，圖像的提取方式並不明確。 我創建了一個空的 docx 並在其中插入了一個圖像。 然后我打開 docx 文件作為 zip 存檔（使用 7zip）並查看 document.xml。 所有圖像信息都作為屬性存儲在 XML 中，而不是像文本那樣存儲在 CDATA 中。 因此，您需要找到您感興趣的標簽並提取您要查找的信息。

例如添加到上面的腳本中：

IMAGE = '{http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing}' + 'docPr'

for image in tree.iter(IMAGE):
    print image.attrib

輸出：

{'id': '1', 'name': 'Picture 1'}

我不是 openxml 格式的專家，但我希望這會有所幫助。

我確實注意到 zip 文件包含一個名為 media 的目錄，其中包含一個名為 image1.jpeg 的文件，其中包含我的嵌入圖像的重命名副本。 您可以在 docx zipfile 中環顧四周以調查可用的內容。

Answer 2

使用 python-docx 在文檔中搜索

# Import the module
from docx import *

# Open the .docx file
document = opendocx('A document.docx')

# Search returns true if found    
search(document,'your search string')

您還有一個獲取文檔文本的函數：

https://github.com/mikemaccana/python-docx/blob/master/docx.py#L910

# Import the module
from docx import *

# Open the .docx file
document = opendocx('A document.docx')
fullText=getdocumenttext(document)

使用https://github.com/mikemaccana/python-docx

Answer 3

pywin32 似乎可以解決問題。 您可以遍歷文檔中的所有表格以及表格中的所有單元格。 獲取數據有點棘手（必須省略每個條目的最后 2 個字符），但除此之外，它是一個十分鍾的代碼。 如果有人需要更多詳細信息，請在評論中說明。

Answer 4

具有圖像提取功能的更簡單的庫。

pip install docx2txt

然后使用以下代碼讀取 docx 文件。

import docx2txt
text = docx2txt.process("file.docx")

Answer 5

使用 python 從 doc/docx 文件中提取文本

import os
import docx2txt
from win32com import client as wc

def extract_text_from_docx(path):
    temp = docx2txt.process(path)
    text = [line.replace('\t', ' ') for line in temp.split('\n') if line]
    final_text = ' '.join(text)
    return final_text

def extract_text_from_doc(doc_path):
    w = wc.Dispatch('Word.Application')
    doc = w.Documents.Open(file_path)
    doc.SaveAs(save_file_name, 16)
    doc.Close()
    w.Quit()
    joinedPath = os.path.join(root_path,save_file_name)
    text = extract_text_from_docx(joinedPath)
    return text

def extract_text(file_path, extension):
    text = ''
    if extension == '.docx':
       text = extract_text_from_docx(file_path)
    else extension == '.doc':
       text = extract_text_from_doc(file_path)
return text

file_path = #file_path with doc/docx file
root_path = #file_path where the doc downloaded
save_file_name = "Final2_text_docx.docx"
final_text = extract_text(file_path, extension)
print(final_text)

如何使用 Python 從 doc/docx 文件中提取數據

問題描述

5 個解決方案

解決方案1
13 2016-05-09 01:42:37

解決方案2
2 2014-03-31 08:36:07

解決方案3
0 已采納 2014-04-08 06:54:21

解決方案4
0 2019-09-09 07:40:04

解決方案5
0 2022-12-22 07:26:45

如何使用 Python 從 doc/docx 文件中提取數據

問題描述

5 個解決方案

解決方案1 13 2016-05-09 01:42:37

解決方案2 2 2014-03-31 08:36:07

解決方案3 0 已采納 2014-04-08 06:54:21

解決方案4 0 2019-09-09 07:40:04

解決方案5 0 2022-12-22 07:26:45

解決方案1
13 2016-05-09 01:42:37

解決方案2
2 2014-03-31 08:36:07

解決方案3
0 已采納 2014-04-08 06:54:21

解決方案4
0 2019-09-09 07:40:04

解決方案5
0 2022-12-22 07:26:45