[英]Extract Header and Footer content from a .doc file using python textract or alternative lib
I'm trying to extract page content along with header and footer content. 我正在尝试提取页面内容以及页眉和页脚内容。 I tried using textract lib.
我尝试使用textract lib。 for .docx it's good.
对于.docx,这很好。 But it's not the same for .doc
但是.doc并不相同
I tried checking other libraries but none of it worked for me. 我尝试检查其他库,但没有一个对我有用。
Below I have snippet for .docx 下面我有.docx的代码段
import textract
def convert_to_txt(filename):
try:
my_text = textract.process(filename, encoding='ascii')
except Exception as e:
msg = "Couldn't able to open the file: {}".format(filename)
raise RuntimeError(msg)
return my_text
There is a better solution to this problem : 有一个更好的解决此问题的方法:
Method Used to extract 提取方法
using MS XML Word document 使用MS XML Word文档
just zip the word document using zip module, It will give you access to xml format of word document, then you can use simple xml node extraction for text. 只需使用zip模块将word文档压缩,即可访问word文档的xml格式,然后可以使用简单的xml节点提取文本。
Following is the working code that extracts Header , Footer , Text Data from a docx file. 以下是从docx文件提取Header , Footer , Text Data的工作代码。
try:
from xml.etree.cElementTree import XML
except ImportError:
from xml.etree.ElementTree import XML
import zipfile
WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'
def get_docx_text(path):
"""
Take the path of a docx file as argument, return the text in unicode.
"""
document = zipfile.ZipFile(path)
contentToRead = ["header2.xml", "document.xml", "footer2.xml"]
paragraphs = []
for xmlfile in contentToRead:
xml_content = document.read('word/{}'.format(xmlfile))
tree = XML(xml_content)
for paragraph in tree.getiterator(PARA):
texts = [node.text
for node in paragraph.getiterator(TEXT)
if node.text]
if texts:
textData = ''.join(texts)
if xmlfile == "footer2.xml":
extractedTxt = "Footer : " + textData
elif xmlfile == "header2.xml":
extractedTxt = "Header : " + textData
else:
extractedTxt = textData
paragraphs.append(extractedTxt)
document.close()
return '\n\n'.join(paragraphs)
print(get_docx_text("E:\\path_to.docx"))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.