简体   繁体   中英

Parsing word doc XML using lxml in Python 3

I have some XML:

<?xml version="1.0" encoding="UTF-8" standalone="true"?>
<w:document mc:Ignorable="w14 w15 w16se w16cid wp14" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:w16se="http://schemas.microsoft.com/office/word/2015/wordml/symex" xmlns:w16cid="http://schemas.microsoft.com/office/word/2016/wordml/cid" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:am3d="http://schemas.microsoft.com/office/drawing/2017/model3d" xmlns:aink="http://schemas.microsoft.com/office/drawing/2016/ink" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:cx8="http://schemas.microsoft.com/office/drawing/2016/5/14/chartex" xmlns:cx7="http://schemas.microsoft.com/office/drawing/2016/5/13/chartex" xmlns:cx6="http://schemas.microsoft.com/office/drawing/2016/5/12/chartex" xmlns:cx5="http://schemas.microsoft.com/office/drawing/2016/5/11/chartex" xmlns:cx4="http://schemas.microsoft.com/office/drawing/2016/5/10/chartex" xmlns:cx3="http://schemas.microsoft.com/office/drawing/2016/5/9/chartex" xmlns:cx2="http://schemas.microsoft.com/office/drawing/2015/10/21/chartex" xmlns:cx1="http://schemas.microsoft.com/office/drawing/2015/9/8/chartex" xmlns:cx="http://schemas.microsoft.com/office/drawing/2014/chartex" xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas">
   <w:body>
     <w:sectPr w:rsidRPr="0019552A" w:rsidR="0019552A">
       <w:pgSz w:w="12240" w:h="15840"/>
       <w:pgMar w:gutter="0" w:footer="720" w:header="720" 
        w:left="1440" w:bottom="1440" w:right="1440" w:top="1440"/>
       <w:cols w:space="720"/> 
       <w:docGrid w:linePitch="360"/>
     </w:sectPr>
   </w:body>
</w:document>

I want to grab the page margin and page size data (w:w='12240', w:gutter='0', etc.). Using lxml, I managed to grab the element containing the pgMar data:

from lxml import etree
root = etree.fromstring(xml)
ns = {'w': root.nsmap['w']}
print(root.find('w:body/w:sectPr/w:pgMar', ns)

but I can't figure out how to grab the attributes.

print(root.find('w:body/w:sectPr/w:pgMar', ns).attrib['w:footer'])

doesn't appear to work.

Working solution:

from lxml import etree

xml_str = b'''<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document mc:Ignorable="w14 w15 w16se w16cid wp14" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:w16se="http://schemas.microsoft.com/office/word/2015/wordml/symex" xmlns:w16cid="http://schemas.microsoft.com/office/word/2016/wordml/cid" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:am3d="http://schemas.microsoft.com/office/drawing/2017/model3d" xmlns:aink="http://schemas.microsoft.com/office/drawing/2016/ink" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:cx8="http://schemas.microsoft.com/office/drawing/2016/5/14/chartex" xmlns:cx7="http://schemas.microsoft.com/office/drawing/2016/5/13/chartex" xmlns:cx6="http://schemas.microsoft.com/office/drawing/2016/5/12/chartex" xmlns:cx5="http://schemas.microsoft.com/office/drawing/2016/5/11/chartex" xmlns:cx4="http://schemas.microsoft.com/office/drawing/2016/5/10/chartex" xmlns:cx3="http://schemas.microsoft.com/office/drawing/2016/5/9/chartex" xmlns:cx2="http://schemas.microsoft.com/office/drawing/2015/10/21/chartex" xmlns:cx1="http://schemas.microsoft.com/office/drawing/2015/9/8/chartex" xmlns:cx="http://schemas.microsoft.com/office/drawing/2014/chartex" xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas">
   <w:body>
     <w:sectPr w:rsidRPr="0019552A" w:rsidR="0019552A">
       <w:pgSz w:w="12240" w:h="15840"/>
       <w:pgMar w:gutter="0" w:footer="720" w:header="720"
        w:left="1440" w:bottom="1440" w:right="1440" w:top="1440"/>
       <w:cols w:space="720"/>
       <w:docGrid w:linePitch="360"/>
     </w:sectPr>
   </w:body>
</w:document>'''

root = etree.fromstring(xml_str)
ns = {'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}
ns_pfx = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
pgSz_el = root.find('.//w:pgSz', ns)
pgMar_el = root.find('.//w:pgMar', ns)

print(pgSz_el.get(ns_pfx + 'w'))           # 12240
print(pgMar_el.get(ns_pfx + 'footer'))     # 720

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM