使用python textract或替代lib从.doc文件中提取页眉和页脚内容

Question

I'm trying to extract page content along with header and footer content. 我正在尝试提取页面内容以及页眉和页脚内容。 I tried using textract lib. 我尝试使用textract lib。 for .docx it's good. 对于.docx，这很好。 But it's not the same for .doc 但是.doc并不相同

I tried checking other libraries but none of it worked for me. 我尝试检查其他库，但没有一个对我有用。

Below I have snippet for .docx 下面我有.docx的代码段

import textract
def convert_to_txt(filename):
    try:
        my_text = textract.process(filename, encoding='ascii')
    except Exception as e:
        msg = "Couldn't able to open the file: {}".format(filename)
        raise RuntimeError(msg)
    return my_text

Answer 1

There is a better solution to this problem : 有一个更好的解决此问题的方法：

Method Used to extract 提取方法

using MS XML Word document 使用MS XML Word文档

just zip the word document using zip module, It will give you access to xml format of word document, then you can use simple xml node extraction for text. 只需使用zip模块将word文档压缩，即可访问word文档的xml格式，然后可以使用简单的xml节点提取文本。

Following is the working code that extracts Header , Footer , Text Data from a docx file. 以下是从docx文件提取Header ， Footer ， Text Data的工作代码。

try:
    from xml.etree.cElementTree import XML
except ImportError:
    from xml.etree.ElementTree import XML
import zipfile    
WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'


def get_docx_text(path):
    """
    Take the path of a docx file as argument, return the text in unicode.
    """
    document = zipfile.ZipFile(path)
    contentToRead = ["header2.xml", "document.xml", "footer2.xml"]
    paragraphs = []

    for xmlfile in contentToRead:
        xml_content = document.read('word/{}'.format(xmlfile))
        tree = XML(xml_content)
        for paragraph in tree.getiterator(PARA):
            texts = [node.text
                     for node in paragraph.getiterator(TEXT)
                     if node.text]
            if texts:
                textData = ''.join(texts)
                if xmlfile == "footer2.xml":
                    extractedTxt = "Footer : " + textData
                elif xmlfile == "header2.xml":
                    extractedTxt = "Header : " + textData
                else:
                    extractedTxt = textData

                paragraphs.append(extractedTxt)
    document.close()
    return '\n\n'.join(paragraphs)


print(get_docx_text("E:\\path_to.docx"))

使用python textract或替代lib从.doc文件中提取页眉和页脚内容

问题描述

1 个解决方案

解决方案1
1 2018-04-01 11:22:54

使用python textract或替代lib从.doc文件中提取页眉和页脚内容

问题描述

1 个解决方案

解决方案1 1 2018-04-01 11:22:54

解决方案1
1 2018-04-01 11:22:54