简体   繁体   English

Python - 从docx文件中删除页眉和页脚

[英]Python - Remove header and footer from docx file

I need to remove headers and footers in many docx files. 我需要在许多docx文件中删除页眉和页脚。 I was currently trying using python-docx library, but it doesn't support header and footer in docx document at this time (work in progress). 我当前正在尝试使用python-docx库,但此时它不支持docx文档中的页眉和页脚(正在进行中)。

Is there any way to achieve that in Python? 有没有办法在Python中实现这一点?

As I understand, docx is a xml-based format, but I don't know how to use it. 据我所知,docx是一种基于xml的格式,但我不知道如何使用它。

PSI have an idea to use lxml or BeautifulSoup to parse xml and replace some parts, but it looks dirty PSI有一个想法,使用lxml或BeautifulSoup来解析xml并替换一些部分,但它看起来很脏

UPD. UPD。 Thanks to Shawn, for a good start point. 感谢Shawn,这是一个很好的起点。 I was made some changes to script. 我对脚本做了一些修改。 This is my final version (it's usefull for me, because I need to edit many .docx files. I'm using BeautifulSoup, because standart xml parser can't get a valid xml-tree. Also, my docx documents doesn't have header and footer in xml. They just placed the header's and footer's images in a top of page. Also, for more speed you can use lxml instead of Soup. 这是我的最终版本(它对我有用,因为我需要编辑许多.docx文件。我使用的是BeautifulSoup,因为标准的xml解析器无法获得有效的xml-tree。而且,我的docx文档没有它们只是将页眉和页脚的图像放在页面的顶部。另外,为了提高速度,你可以使用lxml而不是Soup。

import zipfile
import shutil as su
import os
import tempfile
from bs4 import BeautifulSoup


def get_xml_from_docx(docx_filename):
    """
        Return content of document.xml file inside docx document
    """
    with zipfile.ZipFile(docx_filename) as zf:
        xml_info = zf.read('word/document.xml')
    return xml_info


def write_and_close_docx(self, edited_xml, output_filename):
    """ Create a temp directory, expand the original docx zip.
        Write the modified xml to word/document.xml
        Zip it up as the new docx
    """
    tmp_dir = tempfile.mkdtemp()

    with zipfile.ZipFile(self) as zf:
        zf.extractall(tmp_dir)

    with open(os.path.join(tmp_dir, 'word/document.xml'), 'w') as f:
        f.write(str(edited_xml))

    # Get a list of all the files in the original docx zipfile
    filenames = zf.namelist()
    # Now, create the new zip file and add all the filex into the archive
    zip_copy_filename = output_filename
    docx = zipfile.ZipFile(zip_copy_filename, "w")
    for filename in filenames:
        docx.write(os.path.join(tmp_dir, filename), filename)

    # Clean up the temp dir
    su.rmtree(tmp_dir)


if __name__ == '__main__':
    directory = 'your_directory/'
    files = os.listdir(directory)
    for file in files:
        if file.endswith('.docx'):
            word_doc = directory + file
            new_word_doc = 'edited/' + file.rstrip('.docx') + '-edited.docx'
            tree = get_xml_from_docx(word_doc)
            soup = BeautifulSoup(tree, 'xml')
            shapes = soup.find_all('shape')
            for shape in shapes:
                if 'margin-left:0pt' in shape.get('style'):
                    shape.parent.decompose()
            write_and_close_docx(word_doc, soup, new_word_doc)

So, that's it :) I know, the code isn't clean, sorry for that. 所以,就是这样:)我知道,代码不干净,对不起。

Well, I've never thought about it, but I just created a test.docx with a header and a footer. 好吧,我从未想过它,但我刚刚创建了一个带有页眉和页脚的test.docx。 Once you have that docx, you can unzip it to get the constituent XML files. 拥有该docx后,您可以将其unzip以获取组成XML文件。 For my simple test case this yielded: 对于我的简单测试用例,这产生了:

word/
_rels           footer1.xml     styles.xml
document.xml        footnotes.xml       stylesWithEffects.xml
endnotes.xml        header1.xml     theme
fontTable.xml       settings.xml        webSettings.xml

Opening up the word/documents.xml gives you the main problem area. 打开word/documents.xml这个word/documents.xml为您提供主要问题区域。 You can see that there are elements in there with header and footer involved. 您可以看到其中包含页眉和页脚的元素。 In my simple case I got: 在我的简单案例中,我得到了:

<w:headerReference w:type="default" r:id="rId7"/>
<w:footerReference w:type="default" r:id="rId8"/>

and

<w:pgMar w:top="1440" w:right="1800" w:bottom="1440" w:left="1800" w:header="720" w:footer="720" w:gutter="0"/>

All of the doc is actually small, so 所有的文档实际上都很小,所以

<w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mo="http://schemas.microsoft.com/office/mac/office/2008/main" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:mv="urn:schemas-microsoft-com:mac:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 wp14">
<w:body>
  <w:p w:rsidR="009E6E8F" w:rsidRDefault="009E6E8F"/>
  <w:p w:rsidR="00B53FFA" w:rsidRDefault="00B53FFA"/>
  <w:p w:rsidR="00B53FFA" w:rsidRDefault="00B53FFA"/><w:p w:rsidR="00B53FFA" w:rsidRDefault="00B53FFA">
  <w:r>
  <w:t>MY BODY</w:t>
  </w:r>
  <w:bookmarkStart w:id="0" w:name="_GoBack"/>
  <w:bookmarkEnd w:id="0"/>
  </w:p>
  <w:sectPr w:rsidR="00B53FFA" w:rsidSect="009E6E8F">
  <w:headerReference w:type="default" r:id="rId7"/> 
  <w:footerReference w:type="default" r:id="rId8"/>
  <w:pgSz w:w="12240" w:h="15840"/>
  <w:pgMar w:top="1440" w:right="1800" w:bottom="1440" w:left="1800" w:header="720" w:footer="720" w:gutter="0"/>"""

So XML manipulation is not going to be a problem, either in function or in performance for something this size. 所以XML操作不会成为一个问题,无论是在功能上还是在性能上都是如此。 Here is some code that should get your doc into python, parsed as an xml tree, and saved out back as a docx. 这里有一些代码可以让你的doc进入python,解析为xml树,并作为docx保存回来。 I have to go out now so this isn't your complete solution, but I think that this should get you well down the path. 我现在必须出去所以这不是你的完整解决方案,但我认为这应该让你顺利完成。 If you are still having trouble I will return later and see where you are with it. 如果你仍然遇到麻烦,我会稍后回来,看看你在哪里。

import zipfile
import shutil as su
import os
import tempfile
import xml.etree.cElementTree


 def get_word_xml(docx_filename):
   with open(docx_filename, mode='rt') as f:
      zip = zipfile.ZipFile(f)
      xml_content = zip.read('word/document.xml')
   return xml_content


def write_and_close_docx (self, xml_content, output_filename):
        """ Create a temp directory, expand the original docx zip.
            Write the modified xml to word/document.xml
            Zip it up as the new docx
        """

        tmp_dir = tempfile.mkdtemp()

        self.zipfile.extractall(tmp_dir)

        with open(os.path.join(tmp_dir,'word/document.xml'), 'w') as f:
            xmlstr = tree.tostring(xml_content, pretty_print=True)
            f.write(xmlstr)

        # Get a list of all the files in the original docx zipfile
        filenames = self.zipfile.namelist()
        # Now, create the new zip file and add all the filex into the archive
        zip_copy_filename = output_filename
        with zipfile.ZipFile(zip_copy_filename, "w") as docx:
            for filename in filenames:
                docx.write(os.path.join(tmp_dir,filename), filename)

        # Clean up the temp dir
        su.rmtree(tmp_dir)

def get_xml_tree(f):
    return xml.etree.ElementTree.parse(f)

word_doc = 'TEXT.docx'
new_word_doc = 'SLIM.docx'
doc = get_word_xml(word_doc)
tree = get_xml_tree(doc)
write_and_close_docx(word_doc, tree, new_word_doc)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM