简体   繁体   English

使用 win32com python 在 MS Word 中提取 PDF OLE Object

[英]Extract PDF OLE Object in MS Word using win32com python

This is my very first question here....这是我在这里的第一个问题....

I have a lot of MSWord files with 1 or more PDF inserted as objects, i need to process all de word files and extract the pdfs to save them as pdf files, leaving de MS word file just like i found it.我有很多 MSWord 文件,其中插入了 1 个或多个 PDF 作为对象,我需要处理所有 de word 文件并提取 pdf 以将它们保存为 pdf 文件,留下 de MS word 文件就像我找到它一样。 Until now i have this code to test it in one file:到目前为止,我有这段代码可以在一个文件中对其进行测试:

import win32com.client as win32
word = win32.Dispatch('Word.Application')
word.Application.Visible = False
doc1 = word.Documents.Open('C:\\word_merge\\docx_con_pdfs.docx')
for s in doc1.InlineShapes:
    if s.OLEFormat.ClassType == 'AcroExch.Document.DC':
       s.OLEFormat.DoVerb()
_ = input("Hit Enter to Quit")
doc1.Close()
word.Application.Quit()

I know this work because the s.OLEFormat.DoVerb() effectivly opens the files in Adobe Reader and kept them open until "Hit Enter" moment, when are closed with the word file.我知道这项工作是因为s.OLEFormat.DoVerb()有效地在 Adobe Reader中打开文件并保持它们打开直到“Hit Enter”时刻,当用 word 文件关闭时。

Is in this point when i need to replace DoVerb() with some code that save the OLE Object into a PDF file.在这一点上,当我需要用一些将 OLE Object 保存到 PDF 文件中的代码替换DoVerb()时。

In this point s contains the file i need, but i cant find the way to save it as file instead of only open it.在这一点s包含我需要的文件,但我找不到将其保存为文件的方法,而不仅仅是打开它。

please help me, i have read articles many hours by now and didn't find the answer.请帮助我,我已经阅读了很多小时的文章,但没有找到答案。

i found a workaround in the python-win32 mailing list...... thanks to Chris Else, is like some says in one comment, the.bin file cant be Transformed into a pdf, the code that Chris send me was:我在 python-win32 邮件列表中找到了一种解决方法......感谢 Chris Else,就像一些评论中所说的那样,.bin 文件不能转换为 pdf,Chris 发送给我的代码是:

import olefile
from zipfile import ZipFile
from glob import glob

# How many PDF documents have we saved
pdf_count = 0

# Loop through all the .docx files in the current folder
for filename in glob("*.docx"):
  try:
    # Try to open the document as ZIP file
    with ZipFile(filename, "r") as zip:

      # Find files in the word/embeddings folder of the ZIP file
      for entry in zip.infolist():
        if not entry.filename.startswith("word/embeddings/"):
          continue

        # Try to open the embedded OLE file
        with zip.open(entry.filename) as f:
          if not olefile.isOleFile(f):
            continue

          ole = olefile.OleFileIO(f)

          # CLSID for Adobe Acrobat Document
          if ole.root.clsid != "B801CA65-A1FC-11D0-85AD-444553540000":
            continue

          if not ole.exists("CONTENTS"):
            continue

          # Extract the PDF from the OLE file
          pdf_data = ole.openstream('CONTENTS').read()

          # Does the embedded file have a %PDF- header?
          if pdf_data[0:5] == b'%PDF-':
            pdf_count += 1

            pdf_filename = "Document %d.pdf" % pdf_count

            # Save the PDF
            with open(pdf_filename, "wb") as output_file:
              output_file.write(pdf_data)

  except:
    print("Unable to open '%s'" % filename)

print("Extracted %d PDF documents" % pdf_count)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM