简体   繁体   English

如何从word中提取纯文本文件?

[英]How to extract a plain text file from word?

I have a word file(.docx), with some embedded plain text files.我有一个单词文件(.docx),其中包含一些嵌入的纯文本文件。 How can I extract them with file name?如何使用文件名提取它们?

I have searched, there are some idea.我搜了一下,有一些想法。

  1. using VBA, I'm not good at it.使用VBA,我不擅长。
Sub ExtractAndSaveEmbeddedFiles()
  Dim objEmbeddedShape As InlineShape
  Dim strShapeType As String, strEmbeddedDocName As String
  Dim objEmbeddedDoc As Object
 
  With ActiveDocument
  For Each objEmbeddedShape In .InlineShapes
 
  '  Find and open the embedded doc.
  strShapeType = objEmbeddedShape.OLEFormat.ClassType
  'objEmbeddedShape.OLEFormat.Open
 
  '  Plain text file doesn't have Object method , it'll fail

  Set objEmbeddedDoc = objEmbeddedShape.OLEFormat.Object
 
  '  Save embedded files with names as same as those of icon label.
  strEmbeddedDocName = objEmbeddedShape.OLEFormat.IconLabel
  objEmbeddedDoc.SaveAs "D:\ChromeDownload\test\" & strEmbeddedDocName
  objEmbeddedDoc.Close
 
  Set objEmbeddedDoc = Nothing
 
  Next objEmbeddedShape
  End With
End Sub

  1. rename it to zip all embedded files are stored located at word/embedding but with a.bin extension instead of.txt, and you can not read it directly.将其重命名为 zip 所有嵌入文件都存储在word/embedding中,但扩展名为 .bin 而不是 .txt,您无法直接读取。
  2. POI, there is a class ZipPackagePart can read the.bin file in #2, but still don't know how to extract plain text form it. POI,有一个 class ZipPackagePart可以读取#2 中的.bin 文件,但仍然不知道如何从中提取纯文本。

Is there any way to extract the plain text files in word document?有没有办法提取word文档中的纯文本文件?

I'm actually quite proud of this one:我实际上为这个感到自豪:

    Option Explicit
    Sub ExtractFromMSWordEmbed()
    
        Dim FSO As Object           'File System Object
        Dim FileDir As Variant      'Original File Directory
        Dim FileTemp As Variant     'Tempfilename, changes to filoow file progression
        Dim oFile As Object         'Each embede file
        Dim oFolder As Object       'Folder of embeded files
        Dim FileIndex As Integer    '.txt file reference number
        Dim MSWordTEXT As String    'Text from embebed file
        
        Set FSO = CreateObject("scripting.filesystemobject")
        
        ' > Here you specify the docx file you want to extract embedded files from, 
        '   Filetemp should be in the same folder, it is what we're going to name the
        '   copy of your target file
        FileDir = "C:\Users\ccritchlow\Documents\Text Embed and Extract.docx"
        FileTemp = "C:\Users\ccritchlow\Documents\TempExtract.docx"
        
        ' >>> Create Containing folder for zip contents
        If Dir(Replace(FileTemp, ".docx", "\")) = "" Then '.
            MkDir Replace(FileTemp, ".docx", "\") '.
        End If '.
        
        ' >>> Copy file and change to .zip
        With FSO
            .CopyFile FileDir, FileTemp
            .movefile FileTemp, Replace(FileTemp, ".docx", ".zip")
            FileTemp = Replace(FileTemp, ".docx", ".zip")
            Call UnZipFile(FileTemp, Replace(FileTemp, ".zip", "\"))
            .DeleteFile FileTemp
            FileTemp = Replace(FileTemp, ".zip", "\word\embeddings")
            Set oFolder = .GetFolder(FileTemp)
            For Each oFile In oFolder.Files
            
                    ' *** \/ \/ \/ here is your file text. *** '
                Debug.Print ExtractFromMSWord(oFile.Path)
                    ' *** /\ /\ /\ here is your file text, do with it what  you will. *** '
                    
            Next oFile
        End With
        
    End Sub
    Sub UnZipFile(sZipDir As Variant, sUnZipTo As Variant)
    
        Dim ShellApp As Object
        Set ShellApp = CreateObject("Shell.Application")
        
        ShellApp.Namespace(sUnZipTo).CopyHere ShellApp.Namespace(sZipDir).Items
        
    End Sub
    Function ExtractFromMSWord(DocxDir As String) As Variant
    
        Dim Doc As Document
        Set Doc = Documents.Open(DocxDir)
        
        ExtractFromMSWord = Doc.Content.Text
        Doc.Close
        
    End Function

Make sure add references:确保添加引用:

  • MS word 16.0微软字 16.0
  • Shell Controls Shell 控件

Assuming the type of OLE object you are actually finding actually has ClassType "Package", then they are almost certainly OLE (Object Linking and Embedding) Objects.假设您实际上发现的 OLE object 类型实际上具有 ClassType “Package”,那么它们几乎肯定是 OLE(对象链接和嵌入)对象。 Specifically in the case of a "Package" the text file is encoded inside an "OLE1" format object (OLE1 is a very old version of OLE) that in turn is embedded inside an OLE2 object which is encoded in a format called CFB (Compound File Binary File).特别是在“包”的情况下,文本文件被编码在“OLE1”格式 object(OLE1 是 OLE 的一个非常旧的版本)中,然后嵌入在 OLE2 object 中,该格式被编码为 CFB(复合文件二进制文件)。 That's a hard format to work with from VB(A).这是 VB(A) 中难以使用的格式。 There's an example of how to do it using C# here NB, for short text files, you would typically be able to open the.bin CFB, find the text near the bottom and copy/paste it elsewhere. 这里有一个如何使用 C# 的示例 注意,对于短文本文件,您通常可以打开.bin CFB,在底部附近找到文本并将其复制/粘贴到其他地方。 But for longer files, eg longer than 512 bytes, which is the length of a standard CFB sector, the file will be split over more than one sector and you might have to work rather harder than that.但对于较长的文件,例如超过 512 字节(标准 CFB 扇区的长度),文件将被分割成多个扇区,您可能需要更加努力地工作。

SO to avoid all that, it seems to be possible to save the relevant object to the clipboard, then use the Windows Shell to paste it into a folder, at which point the clipboard seems helpfully to strip the OLE wrappers off.所以为了避免这一切,似乎可以将相关的 object 保存到剪贴板,然后使用 Windows Shell 将其粘贴到文件夹中,此时似乎有助于剥离剪贴板。 There are lots of examples both on SO and "out there", eg here . SO 和“外面”都有很多例子,例如这里 Of course it's a kludge, but it does actually seem to work OK with the test text files I have here.当然,这是一个杂项,但它实际上似乎可以与我在这里的测试文本文件一起工作。

To try it, you will need to create or choose two folders.要尝试它,您需要创建或选择两个文件夹。 One is a temp folder to paste the files from the clipboard.一个是临时文件夹,用于粘贴剪贴板中的文件。 The example I use is "c:\temp" .我使用的示例是"c:\temp" Please delete everything from it before running this code.请在运行此代码之前从中删除所有内容。

The second stores renamed output files.第二个商店更名为 output 文件。 I have called mine c:\target.我打电话给我的 c:\target。

You will also need to make a reference (VB Editor Tools->Reference ) to the Microsoft Shell Controls And Automation library.您还需要参考(VB Editor Tools->Reference )到Microsoft Shell Controls And Automation库。

Then you could use code along the following lines然后您可以使用以下代码

Sub ExtractAndSaveEmbeddedFiles()

  ' The OLE ClassType we're looking for
  Const OLEClassType As String = "Package"

  ' These strings have actually to be variants
  ' to make the Shell calls work
  Const vFolderTemp As Variant = "c:\temp\"
  Const vFolderTarget As Variant = "c:\target\"
  Const vVerbPaste As Variant = "Paste"

  Dim i As Long
  
  Dim objEmbeddedShape As InlineShape
  Dim objFolderTemp As Shell32.Folder
  Dim objFolderTarget As Shell32.Folder
  Dim objShell As Shell32.Shell
  Dim objShellFolderItem As Shell32.ShellFolderItem
  Dim objTempItem As Shell32.FolderItem
  
  i = 0
  
  ' Set up various Shell objects
  Set objShell = New Shell32.Shell
  Set objFolderTemp = objShell.Namespace(vFolderTemp)
  Set objShellFolderItem = objShell.Namespace(vFolderTemp).Self
  Set objFolderTarget = objShell.Namespace(vFolderTarget)
  
  With ActiveDocument
    For Each objEmbeddedShape In .InlineShapes
      If objEmbeddedShape.OLEFormat.ClassType = OLEClassType Then
        
        ' Copy the object to the Clipboard
        objEmbeddedShape.Range.Copy
        
        ' Extract to the temp folder. I don't see a reliable way either
        ' to get the name that the Paste operation will use
        ' (OLEFormat.IconLabel etc. do not do anything useful here)
        ' or set it, although it would be great if .InvokeVerbEx could do it
        objShellFolderItem.InvokeVerb vVerbPaste
        
        ' Change the name to something unique (and perhaps more useful)
        ' We can't use a numeric index into the .Folder's items and even
        ' if we could use the name, we don't know it. So iterate and
        ' (optional) exit when we have dealt with the one item
        For Each objTempItem In objFolderTemp.Items
          ' We can change th ename, but we can't move the file
          ' by changing the path/name
          i = i + 1
          objTempItem.Name = "text object " & CStr(i) & ".txt"
          
          ' now use the target folder object to move the file
          ' These don't appear to *have* to be variants but...
          ' See https://docs.microsoft.com/en-us/windows/win32/shell/folder-movehere
          ' for the cvar(20) parameter
          objFolderTarget.MoveHere CVar(objTempItem), CVar(20)
          Exit For
        Next objTempItem
      End If
    Next objEmbeddedShape
  End With
End Sub

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM