使用python textract或替代lib從.doc文件中提取頁眉和頁腳內容

Question

我正在嘗試提取頁面內容以及頁眉和頁腳內容。 我嘗試使用textract lib。 對於.docx，這很好。 但是.doc並不相同

我嘗試檢查其他庫，但沒有一個對我有用。

下面我有.docx的代碼段

import textract
def convert_to_txt(filename):
    try:
        my_text = textract.process(filename, encoding='ascii')
    except Exception as e:
        msg = "Couldn't able to open the file: {}".format(filename)
        raise RuntimeError(msg)
    return my_text

Answer 1

有一個更好的解決此問題的方法：

提取方法

使用MS XML Word文檔

只需使用zip模塊將word文檔壓縮，即可訪問word文檔的xml格式，然后可以使用簡單的xml節點提取文本。

以下是從docx文件提取Header ， Footer ， Text Data的工作代碼。

try:
    from xml.etree.cElementTree import XML
except ImportError:
    from xml.etree.ElementTree import XML
import zipfile    
WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'


def get_docx_text(path):
    """
    Take the path of a docx file as argument, return the text in unicode.
    """
    document = zipfile.ZipFile(path)
    contentToRead = ["header2.xml", "document.xml", "footer2.xml"]
    paragraphs = []

    for xmlfile in contentToRead:
        xml_content = document.read('word/{}'.format(xmlfile))
        tree = XML(xml_content)
        for paragraph in tree.getiterator(PARA):
            texts = [node.text
                     for node in paragraph.getiterator(TEXT)
                     if node.text]
            if texts:
                textData = ''.join(texts)
                if xmlfile == "footer2.xml":
                    extractedTxt = "Footer : " + textData
                elif xmlfile == "header2.xml":
                    extractedTxt = "Header : " + textData
                else:
                    extractedTxt = textData

                paragraphs.append(extractedTxt)
    document.close()
    return '\n\n'.join(paragraphs)


print(get_docx_text("E:\\path_to.docx"))

使用python textract或替代lib從.doc文件中提取頁眉和頁腳內容

問題描述

1 個解決方案

解決方案1
1 2018-04-01 11:22:54

使用python textract或替代lib從.doc文件中提取頁眉和頁腳內容

問題描述

1 個解決方案

解決方案1 1 2018-04-01 11:22:54

解決方案1
1 2018-04-01 11:22:54