[英]Extract Header and Footer content from a .doc file using python textract or alternative lib
我正在嘗試提取頁面內容以及頁眉和頁腳內容。 我嘗試使用textract lib。 對於.docx,這很好。 但是.doc並不相同
我嘗試檢查其他庫,但沒有一個對我有用。
下面我有.docx的代碼段
import textract
def convert_to_txt(filename):
try:
my_text = textract.process(filename, encoding='ascii')
except Exception as e:
msg = "Couldn't able to open the file: {}".format(filename)
raise RuntimeError(msg)
return my_text
有一個更好的解決此問題的方法:
提取方法
使用MS XML Word文檔
只需使用zip模塊將word文檔壓縮,即可訪問word文檔的xml格式,然后可以使用簡單的xml節點提取文本。
以下是從docx文件提取Header , Footer , Text Data的工作代碼。
try:
from xml.etree.cElementTree import XML
except ImportError:
from xml.etree.ElementTree import XML
import zipfile
WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'
def get_docx_text(path):
"""
Take the path of a docx file as argument, return the text in unicode.
"""
document = zipfile.ZipFile(path)
contentToRead = ["header2.xml", "document.xml", "footer2.xml"]
paragraphs = []
for xmlfile in contentToRead:
xml_content = document.read('word/{}'.format(xmlfile))
tree = XML(xml_content)
for paragraph in tree.getiterator(PARA):
texts = [node.text
for node in paragraph.getiterator(TEXT)
if node.text]
if texts:
textData = ''.join(texts)
if xmlfile == "footer2.xml":
extractedTxt = "Footer : " + textData
elif xmlfile == "header2.xml":
extractedTxt = "Header : " + textData
else:
extractedTxt = textData
paragraphs.append(extractedTxt)
document.close()
return '\n\n'.join(paragraphs)
print(get_docx_text("E:\\path_to.docx"))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.