简体   繁体   English

Python:在没有剪贴板的情况下从 Office/Excel 文档访问嵌入式 OLE

[英]Python: Access embedded OLE from Office/Excel document without clipboard

I want to add and extract files from an Office/Excel document using Python.我想使用 Python 从 Office/Excel 文档中添加和提取文件。 So far adding things is easy but for extracting I haven't found a clean solution.到目前为止,添加东西很容易,但对于提取,我还没有找到一个干净的解决方案。

To make clear what I've got and what not I've written the small example test.py below and explain further.为了弄清楚我有什么和没有什么,我写了下面的小例子test.py并进一步解释。

test.py测试文件

import win32com.client as win32
import os 
from tkinter import messagebox
import win32clipboard

# (0) Setup
dir_path = os.path.dirname(os.path.realpath(__file__))
print(dir_path)
excel = win32.gencache.EnsureDispatch('Excel.Application')
wb = excel.Workbooks.Open(dir_path + "\\" + "test_excel.xlsx")
ws = wb.Worksheets.Item(1)
objs = ws.OLEObjects()

# (1) Embed file
f = dir_path + "\\" + "test_txt.txt"
name = "test_txt_ole.txt"
objs.Add( Filename=f, IconLabel=name )

# (2) Access embedded file
obj = objs.Item(1) # Get single OLE from OLE list
obj.Copy()
win32clipboard.OpenClipboard()
data = win32clipboard.GetClipboardData(0xC004) # Binary access
win32clipboard.EmptyClipboard()
win32clipboard.CloseClipboard()
messagebox.showinfo(title="test_txt_ole.txt", message=str(data))

# (3) Press don't save here to keep 
# wb.Close() # Will close excel document and leave excel opened.
excel.Application.Quit() # Will close excel with all opened documents

For preparation (step 0) it opens a given excel document with one worksheet that was create before by using new document button in excel.为了准备(第 0 步),它打开一个给定的 Excel 文档,其中包含一个之前通过使用 excel 中的新文档按钮创建的工作表。

In step (1) it uses API to embed a given text file to the excel document.步骤 (1) 中,它使用 API 将给定的文本文件嵌入到 Excel 文档中。 The text file was created before with content "TEST123" using a text editor.文本文件是之前使用文本编辑器创建的,内容为“TEST123”。

Afterwards in step (2) it tries to read back content from embedded OLE using clipboard and opens a message box that shows the content from OLE in clipboard.然后在步骤 (2) 中,它尝试使用剪贴板从嵌入式 OLE 读回内容,并打开一个消息框,显示剪贴板中 OLE 的内容。

Finally (3) the program closes the opened document.最后(3)程序关闭打开的文档。 To keep an unchanged setup press no here.要保持设置不变,请在此处按 no。

The big disadvantage of this solution is the use of clipboard which smashes any user content in clipboard which is bad style in a productive environment.该解决方案的最大缺点是使用剪贴板,它会破坏剪贴板中的任何用户内容,这在生产环境中是糟糕的风格。 Further it uses an undocumented option for clipboard.此外,它使用未记录的剪贴板选项。

A better solution would be to safe OLE or OLE embedded file to a python data container or to a file of my choice.更好的解决方案是将 OLE 或 OLE 嵌入文件安全地保存到 python 数据容器或我选择的文件中。 In my example I've used a TXT file to easily identify file data.在我的示例中,我使用了一个 TXT 文件来轻松识别文件数据。 Finally I'll use ZIP for an all-in-one solution but a TXT file solution would be sufficient for base64 data.最后,我将使用 ZIP 作为多合一解决方案,但 TXT 文件解决方案对于 base64 数据就足够了。

Source of 0xC004 = 49156: https://danny.fyi/embedding-and-accessing-a-file-in-excel-with-vba-and-ole-objects-4d4e7863cfff 0xC004 = 49156 的来源: https ://danny.fyi/embedding-and-accessing-a-file-in-excel-with-vba-and-ole-objects-4d4e7863cfff

This VBA example look interesting but I have no clue about VBA: Saving embedded OLE Object (Excel workbook) to file in Excel 2010这个 VBA 示例看起来很有趣,但我对 VBA 一无所知:将嵌入的 OLE 对象(Excel 工作簿)保存到 Excel 2010 中的文件

Well, I find Parfait's solution a bit hackish (in the bad sense) because好吧,我发现 Parfait 的解决方案有点陈旧(在不好的意义上),因为

  • it assumes that Excel will save the embedding as a temporary file,它假设 Excel 会将嵌入保存为临时文件,
  • it assumes that the path of this temporary file is always the user's default temp path,它假定此临时文件的路径始终是用户的默认临时路径,
  • it assumes that you will have privileges to open files there,它假定您将有权在那里打开文件,
  • it assumes that you use a naming convention to identify your objects (eg 'test_txt' is always found in the name, you can't insert an object 'account_data'),它假设您使用命名约定来标识您的对象(例如,名称中始终包含“test_txt”,您不能插入对象“account_data”),
  • it assumes that this convention is not disturbed by the operating system (eg it will not change it to '~test_tx(1)' to save character length),它假定该约定不受操作系统干扰(例如,它不会将其更改为 '~test_tx(1)' 以节省字符长度),
  • it assumes that this convention is known and accepted by all other programs on the computer (no one else will uses names that contain 'test_txt').它假定计算机上的所有其他程序都知道并接受此约定(其他人不会使用包含“test_txt”的名称)。

So, I wrote an alternative solution.所以,我写了一个替代解决方案。 The essence of this is thef following:其本质如下:

  1. unzip the .xlsx file (or any other Office file in the new XML-based format, which is not password protected) to a temporary path.将 .xlsx 文件(或任何其他基于 XML 的新格式的 Office 文件,不受密码保护)解压缩到临时路径。

  2. iterate through all .bin files inside the '/xxx/embeddings' ('xxx' = 'xl' or 'word' or 'ppt'), and create a dictionary that contains the .bin files' temporary paths as keys and the dictionaries returned from step 3 as values.遍历“/xxx/embeddings”(“xxx”=“xl”或“word”或“ppt”)中的所有 .bin 文件,并创建一个字典,其中包含 .bin 文件的临时路径作为键和字典从第 3 步返回的值。

  3. extract information from the .bin file according to the (not very well documented) Ole Packager format, and return the information as a dictionary.根据(没有很好记录的)Ole Packager 格式从 .bin 文件中提取信息,并将信息作为字典返回。 (Retrieves the raw binary data as 'contents', not only from .txt but any file type, eg .png) (检索原始二进制数据作为“内容”,不仅来自 .txt,还来自任何文件类型,例如 .png)

I'm still learning Python, so this is not perfect (no error checking, no performance optimization) but you can get the idea from it.我还在学习 Python,所以这并不完美(没有错误检查,没有性能优化)但是你可以从中得到想法。 I tested it on a few examples.我在几个例子上测试了它。 Here is my code:这是我的代码:

import tempfile
import os
import shutil
import zipfile
import glob
import pythoncom
import win32com.storagecon


def read_zipped_xml_bin_embeddings( path_zipped_xml ):
    temp_dir = tempfile.mkdtemp()

    zip_file = zipfile.ZipFile( path_zipped_xml )
    zip_file.extractall( temp_dir )
    zip_file.close()

    subdir = {
            '.xlsx': 'xl',
            '.xlsm': 'xl',
            '.xltx': 'xl',
            '.xltm': 'xl',
            '.docx': 'word',
            '.dotx': 'word',
            '.docm': 'word',
            '.dotm': 'word',
            '.pptx': 'ppt',
            '.pptm': 'ppt',
            '.potx': 'ppt',
            '.potm': 'ppt',
        }[ os.path.splitext( path_zipped_xml )[ 1 ] ]
    embeddings_dir = temp_dir + '\\' + subdir + '\\embeddings\\*.bin'

    result = {}
    for bin_file in list( glob.glob( embeddings_dir ) ):
        result[ bin_file ] = bin_embedding_to_dictionary( bin_file )

    shutil.rmtree( temp_dir )

    return result


def bin_embedding_to_dictionary( bin_file ):
    storage = pythoncom.StgOpenStorage( bin_file, None, win32com.storagecon.STGM_READ | win32com.storagecon.STGM_SHARE_EXCLUSIVE )
    for stastg in storage.EnumElements():
        if stastg[ 0 ] == '\1Ole10Native':
            stream = storage.OpenStream( stastg[ 0 ], None, win32com.storagecon.STGM_READ | win32com.storagecon.STGM_SHARE_EXCLUSIVE )

            result = {}
            result[ 'original_filename' ] = '' # original filename in ANSI starts at byte 7 and is null terminated
            stream.Seek( 6, 0 )
            while True:
                ch = stream.Read( 1 )
                if ch == '\0':
                    break
                result[ 'original_filename' ] += ch

            result[ 'original_filepath' ] = '' # original filepath in ANSI is next and is null terminated
            while True:
                ch = stream.Read( 1 )
                if ch == '\0':
                    break
                result[ 'original_filepath' ] += ch

            stream.Seek( 4, 1 ) # next 4 bytes is unused

            temporary_filepath_size = 0 # size of the temporary file path in ANSI in little endian
            temporary_filepath_size |= ord( stream.Read( 1 ) ) << 0
            temporary_filepath_size |= ord( stream.Read( 1 ) ) << 8
            temporary_filepath_size |= ord( stream.Read( 1 ) ) << 16
            temporary_filepath_size |= ord( stream.Read( 1 ) ) << 24

            result[ 'temporary_filepath' ] = stream.Read( temporary_filepath_size ) # temporary file path in ANSI

            result[ 'size' ] = 0 # size of the contents in little endian
            result[ 'size' ] |= ord( stream.Read( 1 ) ) << 0
            result[ 'size' ] |= ord( stream.Read( 1 ) ) << 8
            result[ 'size' ] |= ord( stream.Read( 1 ) ) << 16
            result[ 'size' ] |= ord( stream.Read( 1 ) ) << 24

            result[ 'contents' ] = stream.Read( result[ 'size' ] ) # contents

            return result

You can use it like this:你可以这样使用它:

objects = read_zipped_xml_bin_embeddings( dir_path + '\\test_excel.xlsx' )
obj = objects.values()[ 0 ] # Get first element, or iterate somehow, the keys are the temporary paths
print( 'Original filename: ' + obj[ 'original_filename' ] )
print( 'Original filepath: ' + obj[ 'original_filepath' ] )
print( 'Original filepath: ' + obj[ 'temporary_filepath' ] )
print( 'Contents: ' + obj[ 'contents' ] )

Consider using the Windows temp directory that will temporarily store the OLE Object's file source when embedded in workbook.考虑使用 Windows 临时目录,该目录将在嵌入工作簿时临时存储 OLE 对象的文件源。 No clipboard is used in this solution but physical files.此解决方案中不使用剪贴板,而是使用物理文件。

With this approach, you will need to retrieve the current user's name and iterate through all files of the temp directory: C:\\Documents and Settings\\{username}\\Local Settings\\Temp (standard Excel dump folder for Windows Vista/7/8/10).使用这种方法,您将需要检索当前用户的名称并遍历临时目录的所有文件: C:\\Documents and Settings\\{username}\\Local Settings\\Temp (Windows Vista/7/8 的标准 Excel 转储文件夹/10)。 Also, a conditional like-name search with in is used that contains original file's basename as multiple versions with number suffixes (1), (2), (3),... may exist depending on how many times script runs.此外,使用带有in的条件相似名称搜索,其中包含原始文件的基本名称,因为多个版本带有数字后缀 (1)、(2)、(3)、...可能存在,具体取决于脚本运行的次数。 Try even a regex search here.甚至在这里尝试正则表达式搜索。

Finally, the below routine uses try...except...finally block to cleanly exist the Excel objects regardless of error but will output any exception message.最后,下面的例程使用try...except...finally块来干净地存在 Excel 对象,而不管错误如何,但会输出任何异常消息。 Do note this is only a Windows solution using a text file.请注意,这只是使用文本文件的 Windows 解决方案。

import win32com.client as win32
import os, shutil
from tkinter import messagebox

# (0) Setup
dir_path = cd = os.path.dirname(os.path.abspath(__file__))
print(dir_path)

try:
    excel = win32.gencache.EnsureDispatch('Excel.Application')    
    wb = excel.Workbooks.Open(os.path.join(dir_path, "test_excel.xlsx"))
    ws = wb.Worksheets(1)
    objs = ws.OLEObjects()

    # (1) Embed file
    f = os.path.join(dir_path, "test_txt.txt")    
    name = "test_txt_ole.txt"
    objs.Add(Filename=f, IconLabel=name).Name = 'Test'

    # (2) Open file from temporary folder
    ole = ws.OLEObjects(1)        
    ole.Activate()

    # (3) Grab the recent like-named file
    user = os.environ.get('USERNAME')
    outfile = os.path.join(dir_path, "test_txt_out.txt")

    tempfolder = r"C:\Documents and Settings\{}\Local Settings\Temp".format(user)

    for subdir, dirs, files in os.walk(tempfolder):
        for file in sorted(files, reverse=True):
            if 'test_txt' in file:                
                tempfile = os.path.join(tempfolder, file)
                break

    shutil.copyfile(tempfile, outfile)

    # (4) Read text content
    with open(outfile, 'r') as f:        
        content = f.readlines()

    # (5) Output message with content
    messagebox.showinfo(title="test_txt_ole.txt", message="".join(content))

except Exception as e:
    print(e)

finally:
    wb.Close(True)      # CLOSES AND SAVES WORKBOOK
    excel.Quit          # QUITS EXCEL APP

    # RELEASES COM RESOURCES
    ws = None; wb = None; objs = None; ole = None; excel = None

Tkinter Messagebox Tkinter 消息框

消息输出

I built a python module to do exactly this check it out over here.我构建了一个 python 模块来完成这个检查。 https://pypi.org/project/AttachmentsExtractor/ also the module can be run on any os. https://pypi.org/project/AttachmentsExtractor/该模块也可以在任何操作系统上运行。

after installing the library use the following code snippet Code:安装库后使用以下代码片段代码:

 from AttachmentsExtractor import extractor
            
 abs_path_to_file='Please provide absolute path here '
 path_to_destination_directory = 'Please provide path of the directory where the extracted attachments should be stored'
 extractor.extract(abs_path_to_file,path_to_destination_directory) # returns true if one or more attachments are found else returns false.

I've recently attempted to answer a similar question : can I extract embedded word documents from an excel file and save them to disk?我最近试图回答一个类似的问题:我可以从 excel 文件中提取嵌入的 Word 文档并将它们保存到磁盘吗?

Adapting the answers on this page (and making use of the knowledge that an excel file is a zipped collection of, mainly XML, files) this can be easily performed:调整此页面上的答案(并利用 Excel 文件是压缩文件集合(主要是 XML 文件)的知识),可以轻松执行以下操作:

  1. Create temporary file创建临时文件
  2. Extract all contents of excel file into the temporary folder.将excel文件的所有内容提取到临时文件夹中。
  3. Find all embedded files查找所有嵌入的文件
  4. Move the embedded files to a permanent folder of your choice.将嵌入的文件移动到您选择的永久文件夹。

Here is a snippet which does the above:这是执行上述操作的代码段:

import zipfile
import tempfile
import os
import glob
import shutil
import sys

def extract_embedded_files(file_path,
                           save_path,
                           sub_dir='xl'):
    """
    Extracts embedded files from Excel documents, it takes advantage of
    excel being a zipped collection of files. It creates a temporary folder,
    extracts all the contents of the excel folder there and then moves the
    embedded files to the requested save_path.

    Parameters:
    ----------
    file_path : str, 
        The path to the excel file to extract embedded files from.
    
    save_path : str,
        Path to save the extracted files to.

    sub_dir : str,
        one of 'xl' (for excel), 'word' , or 'ppt'. 
    """

    # make a temporary directory 
    temp_dir = tempfile.mkdtemp()

    # extract contents excel file to temporary dir
    zip_file = zipfile.ZipFile(file_path)
    zip_file.extractall(temp_dir)
    zip_file.close()

    # find all embedded files and copy to save_path
    embeddings_dir = f'{temp_dir}/{sub_dir}/embeddings/'
    embedded_files = list(glob.glob(embeddings_dir+'*'))
    for file in embedded_files:
        shutil.copy(file, save_path)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM