繁体   English   中英

如何从word中提取纯文本文件?

[英]How to extract a plain text file from word?

我有一个单词文件(.docx),其中包含一些嵌入的纯文本文件。 如何使用文件名提取它们?

我搜了一下,有一些想法。

  1. 使用VBA,我不擅长。
Sub ExtractAndSaveEmbeddedFiles()
  Dim objEmbeddedShape As InlineShape
  Dim strShapeType As String, strEmbeddedDocName As String
  Dim objEmbeddedDoc As Object
 
  With ActiveDocument
  For Each objEmbeddedShape In .InlineShapes
 
  '  Find and open the embedded doc.
  strShapeType = objEmbeddedShape.OLEFormat.ClassType
  'objEmbeddedShape.OLEFormat.Open
 
  '  Plain text file doesn't have Object method , it'll fail

  Set objEmbeddedDoc = objEmbeddedShape.OLEFormat.Object
 
  '  Save embedded files with names as same as those of icon label.
  strEmbeddedDocName = objEmbeddedShape.OLEFormat.IconLabel
  objEmbeddedDoc.SaveAs "D:\ChromeDownload\test\" & strEmbeddedDocName
  objEmbeddedDoc.Close
 
  Set objEmbeddedDoc = Nothing
 
  Next objEmbeddedShape
  End With
End Sub

  1. 将其重命名为 zip 所有嵌入文件都存储在word/embedding中,但扩展名为 .bin 而不是 .txt,您无法直接读取。
  2. POI,有一个 class ZipPackagePart可以读取#2 中的.bin 文件,但仍然不知道如何从中提取纯文本。

有没有办法提取word文档中的纯文本文件?

我实际上为这个感到自豪:

    Option Explicit
    Sub ExtractFromMSWordEmbed()
    
        Dim FSO As Object           'File System Object
        Dim FileDir As Variant      'Original File Directory
        Dim FileTemp As Variant     'Tempfilename, changes to filoow file progression
        Dim oFile As Object         'Each embede file
        Dim oFolder As Object       'Folder of embeded files
        Dim FileIndex As Integer    '.txt file reference number
        Dim MSWordTEXT As String    'Text from embebed file
        
        Set FSO = CreateObject("scripting.filesystemobject")
        
        ' > Here you specify the docx file you want to extract embedded files from, 
        '   Filetemp should be in the same folder, it is what we're going to name the
        '   copy of your target file
        FileDir = "C:\Users\ccritchlow\Documents\Text Embed and Extract.docx"
        FileTemp = "C:\Users\ccritchlow\Documents\TempExtract.docx"
        
        ' >>> Create Containing folder for zip contents
        If Dir(Replace(FileTemp, ".docx", "\")) = "" Then '.
            MkDir Replace(FileTemp, ".docx", "\") '.
        End If '.
        
        ' >>> Copy file and change to .zip
        With FSO
            .CopyFile FileDir, FileTemp
            .movefile FileTemp, Replace(FileTemp, ".docx", ".zip")
            FileTemp = Replace(FileTemp, ".docx", ".zip")
            Call UnZipFile(FileTemp, Replace(FileTemp, ".zip", "\"))
            .DeleteFile FileTemp
            FileTemp = Replace(FileTemp, ".zip", "\word\embeddings")
            Set oFolder = .GetFolder(FileTemp)
            For Each oFile In oFolder.Files
            
                    ' *** \/ \/ \/ here is your file text. *** '
                Debug.Print ExtractFromMSWord(oFile.Path)
                    ' *** /\ /\ /\ here is your file text, do with it what  you will. *** '
                    
            Next oFile
        End With
        
    End Sub
    Sub UnZipFile(sZipDir As Variant, sUnZipTo As Variant)
    
        Dim ShellApp As Object
        Set ShellApp = CreateObject("Shell.Application")
        
        ShellApp.Namespace(sUnZipTo).CopyHere ShellApp.Namespace(sZipDir).Items
        
    End Sub
    Function ExtractFromMSWord(DocxDir As String) As Variant
    
        Dim Doc As Document
        Set Doc = Documents.Open(DocxDir)
        
        ExtractFromMSWord = Doc.Content.Text
        Doc.Close
        
    End Function

确保添加引用:

  • 微软字 16.0
  • Shell 控件

假设您实际上发现的 OLE object 类型实际上具有 ClassType “Package”,那么它们几乎肯定是 OLE(对象链接和嵌入)对象。 特别是在“包”的情况下,文本文件被编码在“OLE1”格式 object(OLE1 是 OLE 的一个非常旧的版本)中,然后嵌入在 OLE2 object 中,该格式被编码为 CFB(复合文件二进制文件)。 这是 VB(A) 中难以使用的格式。 这里有一个如何使用 C# 的示例 注意,对于短文本文件,您通常可以打开.bin CFB,在底部附近找到文本并将其复制/粘贴到其他地方。 但对于较长的文件,例如超过 512 字节(标准 CFB 扇区的长度),文件将被分割成多个扇区,您可能需要更加努力地工作。

所以为了避免这一切,似乎可以将相关的 object 保存到剪贴板,然后使用 Windows Shell 将其粘贴到文件夹中,此时似乎有助于剥离剪贴板。 SO 和“外面”都有很多例子,例如这里 当然,这是一个杂项,但它实际上似乎可以与我在这里的测试文本文件一起工作。

要尝试它,您需要创建或选择两个文件夹。 一个是临时文件夹,用于粘贴剪贴板中的文件。 我使用的示例是"c:\temp" 请在运行此代码之前从中删除所有内容。

第二个商店更名为 output 文件。 我打电话给我的 c:\target。

您还需要参考(VB Editor Tools->Reference )到Microsoft Shell Controls And Automation库。

然后您可以使用以下代码

Sub ExtractAndSaveEmbeddedFiles()

  ' The OLE ClassType we're looking for
  Const OLEClassType As String = "Package"

  ' These strings have actually to be variants
  ' to make the Shell calls work
  Const vFolderTemp As Variant = "c:\temp\"
  Const vFolderTarget As Variant = "c:\target\"
  Const vVerbPaste As Variant = "Paste"

  Dim i As Long
  
  Dim objEmbeddedShape As InlineShape
  Dim objFolderTemp As Shell32.Folder
  Dim objFolderTarget As Shell32.Folder
  Dim objShell As Shell32.Shell
  Dim objShellFolderItem As Shell32.ShellFolderItem
  Dim objTempItem As Shell32.FolderItem
  
  i = 0
  
  ' Set up various Shell objects
  Set objShell = New Shell32.Shell
  Set objFolderTemp = objShell.Namespace(vFolderTemp)
  Set objShellFolderItem = objShell.Namespace(vFolderTemp).Self
  Set objFolderTarget = objShell.Namespace(vFolderTarget)
  
  With ActiveDocument
    For Each objEmbeddedShape In .InlineShapes
      If objEmbeddedShape.OLEFormat.ClassType = OLEClassType Then
        
        ' Copy the object to the Clipboard
        objEmbeddedShape.Range.Copy
        
        ' Extract to the temp folder. I don't see a reliable way either
        ' to get the name that the Paste operation will use
        ' (OLEFormat.IconLabel etc. do not do anything useful here)
        ' or set it, although it would be great if .InvokeVerbEx could do it
        objShellFolderItem.InvokeVerb vVerbPaste
        
        ' Change the name to something unique (and perhaps more useful)
        ' We can't use a numeric index into the .Folder's items and even
        ' if we could use the name, we don't know it. So iterate and
        ' (optional) exit when we have dealt with the one item
        For Each objTempItem In objFolderTemp.Items
          ' We can change th ename, but we can't move the file
          ' by changing the path/name
          i = i + 1
          objTempItem.Name = "text object " & CStr(i) & ".txt"
          
          ' now use the target folder object to move the file
          ' These don't appear to *have* to be variants but...
          ' See https://docs.microsoft.com/en-us/windows/win32/shell/folder-movehere
          ' for the cvar(20) parameter
          objFolderTarget.MoveHere CVar(objTempItem), CVar(20)
          Exit For
        Next objTempItem
      End If
    Next objEmbeddedShape
  End With
End Sub

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM