在 Windows 中使用 antiword 讀取 Python 中的 .doc 文件（也是 .docx）

Question

我嘗試閱讀.doc文件，例如 -

with open('file.doc', errors='ignore') as f:
    text = f.read()

它確實讀取了那個文件，但是有大量垃圾，我無法刪除那個垃圾，因為我不知道它從哪里開始和結束。

我還嘗試安裝textract模塊，它說它可以從任何文件格式讀取，但是在 Windows 中下載它時存在許多依賴性問題。

所以我交替使用antiword命令行實用程序執行此操作，我的答案如下。

Answer 1

您可以使用antiword命令行實用程序來執行此操作，我知道你們中的大多數人都會嘗試過，但我仍然想分享。

從這里下載antiword

將antiword文件夾解壓縮到C:\並將路徑C:\antiword添加到PATH環境變量中。

這是一個如何使用它的示例，處理 docx 和 doc 文件：

import os, docx2txt
def get_doc_text(filepath, file):
    if file.endswith('.docx'):
       text = docx2txt.process(file)
       return text
    elif file.endswith('.doc'):
       # converting .doc to .docx
       doc_file = filepath + file
       docx_file = filepath + file + 'x'
       if not os.path.exists(docx_file):
          os.system('antiword ' + doc_file + ' > ' + docx_file)
          with open(docx_file) as f:
             text = f.read()
          os.remove(docx_file) #docx_file was just to read, so deleting
       else:
          # already a file with same name as doc exists having docx extension, 
          # which means it is a different file, so we cant read it
          print('Info : file with same name of doc exists having docx extension, so we cant read it')
          text = ''
       return text

現在調用這個函數：

filepath = "D:\\input\\"
files = os.listdir(filepath)
for file in files:
    text = get_doc_text(filepath, file)
    print(text)

這可能是在Windows上用Python讀取.doc文件的好方法。

希望對您有所幫助，謝謝。

Answer 2

Mithilesh 的示例很好，但是一旦安裝了 antiword，直接使用textract會更簡單。 下載antiword ，並將 antiword 文件夾解壓縮到C:\ 。 然后將 antiword 文件夾添加到您的PATH環境變量中。 （此處添加到 PATH 的說明）。 打開一個新的終端或命令控制台以重新加載您的PATH環境變量。 使用pip install textract 。

然后你可以像這樣使用textract （對 .doc 文件使用antiword ）：

import textract
text = textract.process('filename.doc')
text.decode('utf-8')  # converts from bytestring to string

如果您遇到錯誤，請嘗試從終端/控制台運行命令antiword以確保其正常工作。 還要確保 .doc 文件的文件路徑正確（例如使用os.path.exists('filename.doc') ）。

在 Windows 中使用 antiword 讀取 Python 中的 .doc 文件（也是 .docx）

問題描述

2 個解決方案

解決方案1
6 已采納 2018-08-07 12:49:22

解決方案2
0 2021-01-31 17:42:09

在 Windows 中使用 antiword 讀取 Python 中的 .doc 文件（也是 .docx）

問題描述

2 個解決方案

解決方案1 6 已采納 2018-08-07 12:49:22

解決方案2 0 2021-01-31 17:42:09

解決方案1
6 已采納 2018-08-07 12:49:22

解決方案2
0 2021-01-31 17:42:09