[英]Python: Access embedded OLE from Office/Excel document without clipboard
I want to add and extract files from an Office/Excel document using Python.我想使用 Python 从 Office/Excel 文档中添加和提取文件。 So far adding things is easy but for extracting I haven't found a clean solution.
到目前为止,添加东西很容易,但对于提取,我还没有找到一个干净的解决方案。
To make clear what I've got and what not I've written the small example test.py below and explain further.为了弄清楚我有什么和没有什么,我写了下面的小例子test.py并进一步解释。
test.py测试文件
import win32com.client as win32
import os
from tkinter import messagebox
import win32clipboard
# (0) Setup
dir_path = os.path.dirname(os.path.realpath(__file__))
print(dir_path)
excel = win32.gencache.EnsureDispatch('Excel.Application')
wb = excel.Workbooks.Open(dir_path + "\\" + "test_excel.xlsx")
ws = wb.Worksheets.Item(1)
objs = ws.OLEObjects()
# (1) Embed file
f = dir_path + "\\" + "test_txt.txt"
name = "test_txt_ole.txt"
objs.Add( Filename=f, IconLabel=name )
# (2) Access embedded file
obj = objs.Item(1) # Get single OLE from OLE list
obj.Copy()
win32clipboard.OpenClipboard()
data = win32clipboard.GetClipboardData(0xC004) # Binary access
win32clipboard.EmptyClipboard()
win32clipboard.CloseClipboard()
messagebox.showinfo(title="test_txt_ole.txt", message=str(data))
# (3) Press don't save here to keep
# wb.Close() # Will close excel document and leave excel opened.
excel.Application.Quit() # Will close excel with all opened documents
For preparation (step 0) it opens a given excel document with one worksheet that was create before by using new document button in excel.为了准备(第 0 步),它打开一个给定的 Excel 文档,其中包含一个之前通过使用 excel 中的新文档按钮创建的工作表。
In step (1) it uses API to embed a given text file to the excel document.在步骤 (1) 中,它使用 API 将给定的文本文件嵌入到 Excel 文档中。 The text file was created before with content "TEST123" using a text editor.
文本文件是之前使用文本编辑器创建的,内容为“TEST123”。
Afterwards in step (2) it tries to read back content from embedded OLE using clipboard and opens a message box that shows the content from OLE in clipboard.然后在步骤 (2) 中,它尝试使用剪贴板从嵌入式 OLE 读回内容,并打开一个消息框,显示剪贴板中 OLE 的内容。
Finally (3) the program closes the opened document.最后(3)程序关闭打开的文档。 To keep an unchanged setup press no here.
要保持设置不变,请在此处按 no。
The big disadvantage of this solution is the use of clipboard which smashes any user content in clipboard which is bad style in a productive environment.该解决方案的最大缺点是使用剪贴板,它会破坏剪贴板中的任何用户内容,这在生产环境中是糟糕的风格。 Further it uses an undocumented option for clipboard.
此外,它使用未记录的剪贴板选项。
A better solution would be to safe OLE or OLE embedded file to a python data container or to a file of my choice.更好的解决方案是将 OLE 或 OLE 嵌入文件安全地保存到 python 数据容器或我选择的文件中。 In my example I've used a TXT file to easily identify file data.
在我的示例中,我使用了一个 TXT 文件来轻松识别文件数据。 Finally I'll use ZIP for an all-in-one solution but a TXT file solution would be sufficient for base64 data.
最后,我将使用 ZIP 作为多合一解决方案,但 TXT 文件解决方案对于 base64 数据就足够了。
Source of 0xC004 = 49156: https://danny.fyi/embedding-and-accessing-a-file-in-excel-with-vba-and-ole-objects-4d4e7863cfff 0xC004 = 49156 的来源: https ://danny.fyi/embedding-and-accessing-a-file-in-excel-with-vba-and-ole-objects-4d4e7863cfff
This VBA example look interesting but I have no clue about VBA: Saving embedded OLE Object (Excel workbook) to file in Excel 2010这个 VBA 示例看起来很有趣,但我对 VBA 一无所知:将嵌入的 OLE 对象(Excel 工作簿)保存到 Excel 2010 中的文件
Well, I find Parfait's solution a bit hackish (in the bad sense) because好吧,我发现 Parfait 的解决方案有点陈旧(在不好的意义上),因为
So, I wrote an alternative solution.所以,我写了一个替代解决方案。 The essence of this is thef following:
其本质如下:
unzip the .xlsx file (or any other Office file in the new XML-based format, which is not password protected) to a temporary path.将 .xlsx 文件(或任何其他基于 XML 的新格式的 Office 文件,不受密码保护)解压缩到临时路径。
iterate through all .bin files inside the '/xxx/embeddings' ('xxx' = 'xl' or 'word' or 'ppt'), and create a dictionary that contains the .bin files' temporary paths as keys and the dictionaries returned from step 3 as values.遍历“/xxx/embeddings”(“xxx”=“xl”或“word”或“ppt”)中的所有 .bin 文件,并创建一个字典,其中包含 .bin 文件的临时路径作为键和字典从第 3 步返回的值。
extract information from the .bin file according to the (not very well documented) Ole Packager format, and return the information as a dictionary.根据(没有很好记录的)Ole Packager 格式从 .bin 文件中提取信息,并将信息作为字典返回。 (Retrieves the raw binary data as 'contents', not only from .txt but any file type, eg .png)
(检索原始二进制数据作为“内容”,不仅来自 .txt,还来自任何文件类型,例如 .png)
I'm still learning Python, so this is not perfect (no error checking, no performance optimization) but you can get the idea from it.我还在学习 Python,所以这并不完美(没有错误检查,没有性能优化)但是你可以从中得到想法。 I tested it on a few examples.
我在几个例子上测试了它。 Here is my code:
这是我的代码:
import tempfile
import os
import shutil
import zipfile
import glob
import pythoncom
import win32com.storagecon
def read_zipped_xml_bin_embeddings( path_zipped_xml ):
temp_dir = tempfile.mkdtemp()
zip_file = zipfile.ZipFile( path_zipped_xml )
zip_file.extractall( temp_dir )
zip_file.close()
subdir = {
'.xlsx': 'xl',
'.xlsm': 'xl',
'.xltx': 'xl',
'.xltm': 'xl',
'.docx': 'word',
'.dotx': 'word',
'.docm': 'word',
'.dotm': 'word',
'.pptx': 'ppt',
'.pptm': 'ppt',
'.potx': 'ppt',
'.potm': 'ppt',
}[ os.path.splitext( path_zipped_xml )[ 1 ] ]
embeddings_dir = temp_dir + '\\' + subdir + '\\embeddings\\*.bin'
result = {}
for bin_file in list( glob.glob( embeddings_dir ) ):
result[ bin_file ] = bin_embedding_to_dictionary( bin_file )
shutil.rmtree( temp_dir )
return result
def bin_embedding_to_dictionary( bin_file ):
storage = pythoncom.StgOpenStorage( bin_file, None, win32com.storagecon.STGM_READ | win32com.storagecon.STGM_SHARE_EXCLUSIVE )
for stastg in storage.EnumElements():
if stastg[ 0 ] == '\1Ole10Native':
stream = storage.OpenStream( stastg[ 0 ], None, win32com.storagecon.STGM_READ | win32com.storagecon.STGM_SHARE_EXCLUSIVE )
result = {}
result[ 'original_filename' ] = '' # original filename in ANSI starts at byte 7 and is null terminated
stream.Seek( 6, 0 )
while True:
ch = stream.Read( 1 )
if ch == '\0':
break
result[ 'original_filename' ] += ch
result[ 'original_filepath' ] = '' # original filepath in ANSI is next and is null terminated
while True:
ch = stream.Read( 1 )
if ch == '\0':
break
result[ 'original_filepath' ] += ch
stream.Seek( 4, 1 ) # next 4 bytes is unused
temporary_filepath_size = 0 # size of the temporary file path in ANSI in little endian
temporary_filepath_size |= ord( stream.Read( 1 ) ) << 0
temporary_filepath_size |= ord( stream.Read( 1 ) ) << 8
temporary_filepath_size |= ord( stream.Read( 1 ) ) << 16
temporary_filepath_size |= ord( stream.Read( 1 ) ) << 24
result[ 'temporary_filepath' ] = stream.Read( temporary_filepath_size ) # temporary file path in ANSI
result[ 'size' ] = 0 # size of the contents in little endian
result[ 'size' ] |= ord( stream.Read( 1 ) ) << 0
result[ 'size' ] |= ord( stream.Read( 1 ) ) << 8
result[ 'size' ] |= ord( stream.Read( 1 ) ) << 16
result[ 'size' ] |= ord( stream.Read( 1 ) ) << 24
result[ 'contents' ] = stream.Read( result[ 'size' ] ) # contents
return result
You can use it like this:你可以这样使用它:
objects = read_zipped_xml_bin_embeddings( dir_path + '\\test_excel.xlsx' )
obj = objects.values()[ 0 ] # Get first element, or iterate somehow, the keys are the temporary paths
print( 'Original filename: ' + obj[ 'original_filename' ] )
print( 'Original filepath: ' + obj[ 'original_filepath' ] )
print( 'Original filepath: ' + obj[ 'temporary_filepath' ] )
print( 'Contents: ' + obj[ 'contents' ] )
Consider using the Windows temp directory that will temporarily store the OLE Object's file source when embedded in workbook.考虑使用 Windows 临时目录,该目录将在嵌入工作簿时临时存储 OLE 对象的文件源。 No clipboard is used in this solution but physical files.
此解决方案中不使用剪贴板,而是使用物理文件。
With this approach, you will need to retrieve the current user's name and iterate through all files of the temp directory: C:\\Documents and Settings\\{username}\\Local Settings\\Temp (standard Excel dump folder for Windows Vista/7/8/10).使用这种方法,您将需要检索当前用户的名称并遍历临时目录的所有文件: C:\\Documents and Settings\\{username}\\Local Settings\\Temp (Windows Vista/7/8 的标准 Excel 转储文件夹/10)。 Also, a conditional like-name search with
in
is used that contains original file's basename as multiple versions with number suffixes (1), (2), (3),... may exist depending on how many times script runs.此外,使用带有
in
的条件相似名称搜索,其中包含原始文件的基本名称,因为多个版本带有数字后缀 (1)、(2)、(3)、...可能存在,具体取决于脚本运行的次数。 Try even a regex search here.甚至在这里尝试正则表达式搜索。
Finally, the below routine uses try...except...finally
block to cleanly exist the Excel objects regardless of error but will output any exception message.最后,下面的例程使用
try...except...finally
块来干净地存在 Excel 对象,而不管错误如何,但会输出任何异常消息。 Do note this is only a Windows solution using a text file.请注意,这只是使用文本文件的 Windows 解决方案。
import win32com.client as win32
import os, shutil
from tkinter import messagebox
# (0) Setup
dir_path = cd = os.path.dirname(os.path.abspath(__file__))
print(dir_path)
try:
excel = win32.gencache.EnsureDispatch('Excel.Application')
wb = excel.Workbooks.Open(os.path.join(dir_path, "test_excel.xlsx"))
ws = wb.Worksheets(1)
objs = ws.OLEObjects()
# (1) Embed file
f = os.path.join(dir_path, "test_txt.txt")
name = "test_txt_ole.txt"
objs.Add(Filename=f, IconLabel=name).Name = 'Test'
# (2) Open file from temporary folder
ole = ws.OLEObjects(1)
ole.Activate()
# (3) Grab the recent like-named file
user = os.environ.get('USERNAME')
outfile = os.path.join(dir_path, "test_txt_out.txt")
tempfolder = r"C:\Documents and Settings\{}\Local Settings\Temp".format(user)
for subdir, dirs, files in os.walk(tempfolder):
for file in sorted(files, reverse=True):
if 'test_txt' in file:
tempfile = os.path.join(tempfolder, file)
break
shutil.copyfile(tempfile, outfile)
# (4) Read text content
with open(outfile, 'r') as f:
content = f.readlines()
# (5) Output message with content
messagebox.showinfo(title="test_txt_ole.txt", message="".join(content))
except Exception as e:
print(e)
finally:
wb.Close(True) # CLOSES AND SAVES WORKBOOK
excel.Quit # QUITS EXCEL APP
# RELEASES COM RESOURCES
ws = None; wb = None; objs = None; ole = None; excel = None
Tkinter Messagebox Tkinter 消息框
I built a python module to do exactly this check it out over here.我构建了一个 python 模块来完成这个检查。 https://pypi.org/project/AttachmentsExtractor/ also the module can be run on any os.
https://pypi.org/project/AttachmentsExtractor/该模块也可以在任何操作系统上运行。
after installing the library use the following code snippet Code:安装库后使用以下代码片段代码:
from AttachmentsExtractor import extractor
abs_path_to_file='Please provide absolute path here '
path_to_destination_directory = 'Please provide path of the directory where the extracted attachments should be stored'
extractor.extract(abs_path_to_file,path_to_destination_directory) # returns true if one or more attachments are found else returns false.
I've recently attempted to answer a similar question : can I extract embedded word documents from an excel file and save them to disk?我最近试图回答一个类似的问题:我可以从 excel 文件中提取嵌入的 Word 文档并将它们保存到磁盘吗?
Adapting the answers on this page (and making use of the knowledge that an excel file is a zipped collection of, mainly XML, files) this can be easily performed:调整此页面上的答案(并利用 Excel 文件是压缩文件集合(主要是 XML 文件)的知识),可以轻松执行以下操作:
Here is a snippet which does the above:这是执行上述操作的代码段:
import zipfile
import tempfile
import os
import glob
import shutil
import sys
def extract_embedded_files(file_path,
save_path,
sub_dir='xl'):
"""
Extracts embedded files from Excel documents, it takes advantage of
excel being a zipped collection of files. It creates a temporary folder,
extracts all the contents of the excel folder there and then moves the
embedded files to the requested save_path.
Parameters:
----------
file_path : str,
The path to the excel file to extract embedded files from.
save_path : str,
Path to save the extracted files to.
sub_dir : str,
one of 'xl' (for excel), 'word' , or 'ppt'.
"""
# make a temporary directory
temp_dir = tempfile.mkdtemp()
# extract contents excel file to temporary dir
zip_file = zipfile.ZipFile(file_path)
zip_file.extractall(temp_dir)
zip_file.close()
# find all embedded files and copy to save_path
embeddings_dir = f'{temp_dir}/{sub_dir}/embeddings/'
embedded_files = list(glob.glob(embeddings_dir+'*'))
for file in embedded_files:
shutil.copy(file, save_path)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.