使用 python 讀取 .doc 文件

Question

我得到了一份工作申請測試，我的交易是閱讀一些 .doc 文件。 有誰知道圖書館可以做到這一點？ 我從原始的 python 代碼開始：

f = open('test.doc', 'r')
f.read()

但這不會返回友好的字符串，我需要將其轉換為 utf-8

編輯：我只想從此文件中獲取文本

Answer 1

可以使用texttract庫。 它同時處理“doc”和“docx”

import textract
text = textract.process("path/to/file.extension")

您甚至可以使用“antiword”（sudo apt-get install antiword），然后將 doc to first 轉換為 docx，然后通讀docx2txt 。

antiword filename.doc > filename.docx

最終，后端的 textract 使用的是 antiword。

Answer 2

您可以使用python-docx2txt庫從 Microsoft Word 文檔中讀取文本。 它是對python-docx庫的改進，因為它還可以從鏈接、頁眉和頁腳中提取文本。 它甚至可以提取圖像。

您可以通過運行來安裝它： pip install docx2txt 。

讓我們在這里下載並閱讀第一個 Microsoft 文檔：

import docx2txt
my_text = docx2txt.process("test.docx")
print(my_text)

這是終端輸出上述代碼的屏幕截圖：

編輯：

這不為.doc文件工作。 我保留這個答案的唯一原因是似乎有人發現它對.docx文件有用。

Answer 3

我也試圖這樣做，我發現了很多關於閱讀 .docx 的信息，但很少有關於 .doc 的信息； 無論如何，我設法使用以下內容閱讀了文本：

import win32com.client

word = win32com.client.Dispatch("Word.Application")
word.visible = False
wb = word.Documents.Open("myfile.doc")
doc = word.ActiveDocument
print(doc.Range().Text)

Answer 4

Shivam Kotwalia 的回答非常有效。 但是，對象是作為字節類型導入的。 有時你可能需要它作為一個字符串來執行 REGEX 或類似的東西。

我推薦以下代碼（Shivam Kotwalia 的回答中的兩行）：

import textract

text = textract.process("path/to/file.extension")
text = text.decode("utf-8")

最后一行將對象文本轉換為字符串。

Answer 5

我同意 Shivam 的回答，除了 windows 不存在textract 。 而且，由於某種原因， antiword也無法讀取“.doc”文件並給出錯誤：

'filename.doc' is not a word document. # This happens when the file wasn't generated via MS Office. Eg: Web-pages may be stored in .doc format offline.

所以，我有以下解決方法來提取文本：

from bs4 import BeautifulSoup as bs
soup = bs(open(filename).read())
[s.extract() for s in soup(['style', 'script'])]
tmpText = soup.get_text()
text = "".join("".join(tmpText.split('\t')).split('\n')).encode('utf-8').strip()
print text

此腳本適用於大多數類型的文件。 玩得開心！

Answer 6

先決條件：

安裝 antiword : sudo apt-get install antiword

安裝 docx : pip install docx

from subprocess import Popen, PIPE

from docx import opendocx, getdocumenttext
from cStringIO import StringIO
def document_to_text(filename, file_path):
    cmd = ['antiword', file_path]
    p = Popen(cmd, stdout=PIPE)
    stdout, stderr = p.communicate()
    return stdout.decode('ascii', 'ignore')

print document_to_text('your_file_name','your_file_path')

注意 – 新版本的 python-docx 刪除了這個功能。 確保 pip install docx 而不是新的 python-docx

Answer 7

!pip 安裝 python-docx

import docx

#Creating a word file object
doc = open("file.docx","rb")

#creating word reader object
document = docx.Document(doc)

Answer 8

我不得不做同樣的事情來搜索大量的 *.doc 文件以獲取特定數字並提出：

special_chars = {
    "b'\\t'": '\t',
    "b'\\r'": '\n',
    "b'\\x07'": '|',
    "b'\\xc4'": 'Ä',
    "b'\\xe4'": 'ä',
    "b'\\xdc'": 'Ü',
    "b'\\xfc'": 'ü',
    "b'\\xd6'": 'Ö',
    "b'\\xf6'": 'ö',
    "b'\\xdf'": 'ß',
    "b'\\xa7'": '§',
    "b'\\xb0'": '°',
    "b'\\x82'": '‚',
    "b'\\x84'": '„',
    "b'\\x91'": '‘',
    "b'\\x93'": '“',
    "b'\\x96'": '-',
    "b'\\xb4'": '´'
}


def get_string(path):
    string = ''
    with open(path, 'rb') as stream:
        stream.seek(2560) # Offset - text starts after byte 2560
        current_stream = stream.read(1)
        while not (str(current_stream) == "b'\\xfa'"):
            if str(current_stream) in special_chars.keys():
                string += special_chars[str(current_stream)]
            else:
                try:
                    char = current_stream.decode('UTF-8')
                    if char.isalnum():
                        string += char
                except UnicodeDecodeError:
                    string += ''
            current_stream = stream.read(1)
    return string

我不確定這個解決方案有多“干凈”，但它適用於正則表達式。

Answer 9

我一直在尋找解決方案。 .doc文件資料不夠，最后我把類型.doc .docx解決了這個問題

from win32com import client as wc
w = wc.Dispatch('Word.Application')
# Or use the following method to start a separate process:
# w = wc.DispatchEx('Word.Application')
doc=w.Documents.Open(os.path.abspath('test.doc'))
doc.SaveAs("test_docx.docx",16)

Answer 10

如果您正在尋找如何閱讀 python 中的文檔文件，則此代碼將運行，首先安裝所有相關包並查看結果。

如果文檔文件：

    _file=requests.get(request.values['MediaUrl0'])

    doc_file_link=BytesIO(_file.content)

    file_path=os.getcwd()+'\+data.doc'

    E=open(file_path,'wb')
    E.write(doc_file_link.getbuffer())
    E.close()

    word = win32.gencache.EnsureDispatch('Word.Application',pythoncom.CoInitialize())
    doc = word.Documents.Open(file_path)
    doc.Activate()
    doc_data=doc.Range().Text
    print(doc_data)
    doc.Close(False)

    if os.path.exists(file_path):
       os.remove(file_path)

使用 python 讀取 .doc 文件

問題描述

10 個解決方案

解決方案1
40 2017-03-31 08:18:42

解決方案2
27 2016-03-15 07:04:30

解決方案3
23 2018-06-11 10:54:59

解決方案4
10 2019-11-08 18:02:43

解決方案5
6 2019-06-14 05:53:56

解決方案6
4 2017-12-26 06:32:53

解決方案7
1 2022-02-27 09:19:52

解決方案8
0 2021-07-07 09:25:50

解決方案9
0 2021-07-28 08:36:05

解決方案10
0 2022-12-27 15:26:07

使用 python 讀取 .doc 文件

問題描述

10 個解決方案

解決方案1 40 2017-03-31 08:18:42

解決方案2 27 2016-03-15 07:04:30

解決方案3 23 2018-06-11 10:54:59

解決方案4 10 2019-11-08 18:02:43

解決方案5 6 2019-06-14 05:53:56

解決方案6 4 2017-12-26 06:32:53

解決方案7 1 2022-02-27 09:19:52

解決方案8 0 2021-07-07 09:25:50

解決方案9 0 2021-07-28 08:36:05

解決方案10 0 2022-12-27 15:26:07

解決方案1
40 2017-03-31 08:18:42

解決方案2
27 2016-03-15 07:04:30

解決方案3
23 2018-06-11 10:54:59

解決方案4
10 2019-11-08 18:02:43

解決方案5
6 2019-06-14 05:53:56

解決方案6
4 2017-12-26 06:32:53

解決方案7
1 2022-02-27 09:19:52

解決方案8
0 2021-07-07 09:25:50

解決方案9
0 2021-07-28 08:36:05

解決方案10
0 2022-12-27 15:26:07