簡體   English   中英

使用 VBA 和 Adobe Acrobat XI 標准將 PDF 轉換為文本文件

[英]Convert PDF to text file using VBA and Adobe Acrobat XI standard

上一篇文章的第 3 部分。

任務:我正在嘗試遍歷 excel 中提供的一系列 URL,並為每個 URL 生成完整的文本文件。

至此:上一篇中的VBA解決方案使用Word打開PDF Url,由此生成文本文件。 這比使用電源查詢要好得多。 但是,我遇到了一個失敗的實例,因為它將第一頁上的文本識別為圖像; 因此,在生成的文本文件中省略了該文本。

作為生成的截斷文本文件的示例: https://hpvchemicals.oecd.org/ui/handler.axd?id=b4b38713-7580-4843-86fd-614821a6f72b是一個示例,其中生成的 ZBCD1B68640 缺少 textAB59A8DFC1ZFF0 文件

按照建議,我已經下載了 Adobe Acrobat(Adobe Acrobat XI Standard)的一個版本,它啟用了Adobe Acrobat 10.0 Type Library參考(以及其他一些)。 我希望這將實現更准確的文本轉換,但需要幫助來開發此腳本。

所以,對於這個問題,我想修改下面的代碼,把 Word 換成 Abode 來執行轉換。 然而,我不確定的是,Adobe 是否可以像 word 一樣從 URL 打開 PDF。 我想我可能需要介紹一些步驟來首先在文件夾中下載 PDF,然后為每個轉換這些。 理想情況下,它會像上面的代碼一樣工作,但如果這是必要的,那就這樣吧。

上一個答案的 M 代碼:

Sub Tester()

    Dim filePath As String
    Dim fso As FileSystemObject, url
    Dim fileStream As TextStream, ws As Worksheet
    Dim oWd As Object, oDoc As Object, c As Range, fileRoot As String
    
    Set fso = New FileSystemObject
    Set oWd = CreateObject("word.application")
    
    Set ws = Worksheets("Data")     'use a specific worksheet reference
    fileRoot = ws.Range("D2").Value 'read this once
    If Right(fileRoot, 1) <> "\" Then fileRoot = fileRoot & "\" 'ensure terminating \
    
    For Each c In ws.Range("B2:B" & ws.Cells(Rows.Count, "B").End(xlUp).row).Cells
        url = Trim(c.Value)
        If LCase(url) Like "http?:*" Then  'has a URL
            Set oDoc = Nothing
            On Error Resume Next 'ignore error if no document...
            Set oDoc = oWd.Documents.Open(url)
            On Error GoTo 0      'stop ignoring errors
            If Not oDoc Is Nothing Then
                filePath = fileRoot & c.Offset(0, -1).Value & ".txt" 'filename from ColA
                Debug.Print filePath
                'open text stream as unicode
                Set fileStream = fso.CreateTextFile(filePath, overwrite:=True, Unicode:=True)
                fileStream.Write oDoc.Range.Text
                fileStream.Close
                oDoc.Close
                c.Interior.Color = vbGreen 'flag OK
            Else
                c.Interior.Color = vbRed   'flag problem
            End If
        End If 'have url
    Next c
    
    oWd.Quit
End Sub

至於開發這個腳本,我從網上找的有以下,現在用Acrobat成功的對之前比較難的PDF進行了轉換,所以很好的開始。

子轉換pdf2()

    Dim AcroXApp As Acrobat.AcroApp
    Dim AcroXAVDoc As Acrobat.AcroAVDoc
    Dim AcroXPDDoc As Acrobat.AcroPDDoc
    Dim Filename As String
    Dim jsObj As Object
    Dim NewFileName As String

    Filename = "C:\temp\name1.pdf"
    NewFileName = "C:\temp\name1.txt"

    Set AcroXApp = CreateObject("AcroExch.App")
    'AcroXApp.Show

    Set AcroXAVDoc = CreateObject("AcroExch.AVDoc")
    AcroXAVDoc.Open Filename, "Acrobat"

    Set AcroXPDDoc = AcroXAVDoc.GetPDDoc


    Set jsObj = AcroXPDDoc.GetJSObject


    jsObj.SaveAs NewFileName, "com.adobe.acrobat.plain-text"


    AcroXAVDoc.Close False
    AcroXApp.Hide
    AcroXApp.Exit

    End Sub

其中 name1 是下載的 PDF。

所以,本質上我試圖:

  • 從 URL 列表下載 PDF/打開 PDF 到指定文件夾(如有必要)
  • 使用 VBA 將其轉換為文本

我想如果我能做到這一點,那么我將能夠弄清楚如何將它與上一篇文章合並,以便它可以循環通過給定數量的 URL 來生成文本文件以及處理非 unicofe 字符。

如果我有任何進展會更新

Excel 中的設置: 在此處輸入圖像描述

網址:

https://hpvchemicals.oecd.org/ui/handler.axd?id=e19d2799-0c16-496d-a607-b09330dd28a7
https://hpvchemicals.oecd.org/ui/handler.axd?id=40da06b1-a855-4c0c-bc21-bbc856dca725
https://hpvchemicals.oecd.org/ui/handler.axd?id=c4967546-1f5e-472a-b629-a2998323735b
https://hpvchemicals.oecd.org/ui/handler.axd?id=bde5e625-83ee-423d-aa70-eb0e453088e4
https://hpvchemicals.oecd.org/ui/handler.axd?id=621c4f55-ef3c-4b99-bb98-e6aaf3f436dd
https://hpvchemicals.oecd.org/ui/handler.axd?id=26e1420d-f9b7-4768-b6fa-d345f54e7683
https://hpvchemicals.oecd.org/ui/handler.axd?id=263f3491-90c7-4c3a-b43e-4c4e9395bcea
https://hpvchemicals.oecd.org/ui/handler.axd?id=b78d39a9-26c2-48ff-aadc-cb056a89f08b
https://hpvchemicals.oecd.org/ui/handler.axd?id=97a7b56f-ebaf-4416-8b4b-88b19ca3bd16
https://hpvchemicals.oecd.org/ui/handler.axd?id=c6c3b7c1-9239-40d9-b51a-85a15e2411d6

更新:

在此處輸入圖像描述

測試:

#If VBA7 Then
    Private Declare PtrSafe Function URLDownloadToFile Lib "urlmon" _
      Alias "URLDownloadToFileA" (ByVal pCaller As LongPtr, ByVal szURL As String, _
        ByVal szFileName As String, ByVal dwReserved As Long, ByVal lpfnCB As LongPtr) As Long
#Else
    Private Declare Function URLDownloadToFile Lib "urlmon" _
      Alias "URLDownloadToFileA" (ByVal pCaller As Long, ByVal szURL As String, _
        ByVal szFileName As String, ByVal dwReserved As Long, ByVal lpfnCB As Long) As Long
#End If

Sub Tester()
    Dim filePath As String
    Dim fso As FileSystemObject, url
    Dim fileStream As TextStream, ws As Worksheet
    Dim c As Range, fileRoot As String, pdfPath As String
    
    Set ws = Worksheets("Data")     'use a specific worksheet reference
    fileRoot = ws.Range("D2").Value 'read this once
    If Right(fileRoot, 1) <> "\" Then fileRoot = fileRoot & "\" 'ensure terminating \
    
    For Each c In ws.Range("B2:B" & ws.Cells(Rows.Count, "B").End(xlUp).row).Cells
        url = Trim(c.Value)
        If LCase(url) Like "http?:*" Then  'has a URL
            
            pdfPath = fileRoot & "PDF_" & c.Offset(0, -1).Value & ".pdf"
            DownloadFile url, pdfPath
            ConvertPdf2 pdfPath, fileRoot & c.Offset(0, -1).Value & ".txt"
            
        End If 'have url
    Next c
End Sub

Function DownloadFile(sURL, sSaveAs) As Boolean
    DownloadFile = (URLDownloadToFile(0, sURL, sSaveAs, 0, 0) = 0)
End Function

Sub ConvertPdf2(pdfPath As String, textPath As String)
    Dim AcroXApp As Acrobat.AcroApp
    Dim AcroXAVDoc As Acrobat.AcroAVDoc
    Dim AcroXPDDoc As Acrobat.AcroPDDoc
    Dim jsObj As Object

    Set AcroXApp = CreateObject("AcroExch.App")
    Set AcroXAVDoc = CreateObject("AcroExch.AVDoc")
    AcroXAVDoc.Open pdfPath, "Acrobat"
    Set AcroXPDDoc = AcroXAVDoc.GetPDDoc
    Set jsObj = AcroXPDDoc.GetJSObject
    jsObj.SaveAs textPath, "com.adobe.acrobat.plain-text"
    AcroXAVDoc.Close False
    AcroXApp.Hide
    AcroXApp.Exit
End Sub

我不在 position 中幫助自動化 acrobat,但這是我用來下載 PDF 的程序。 將 URL 提供給 PDF 文件和保存文件的本地路徑。 僅適用於 Excel 的 windows 版本。

Sub saveFile(url As String, Optional path As String)

    Dim http As Object
    Dim objfso As Object
    Dim objADOStream As Object
    
    If path = "" Then path = ThisWorkbook.path & "\file.pdf"
    
    Set http = CreateObject("MSXML2.serverXMLHTTP")
    http.Open "GET", url, False
    http.Send
 
    If http.Status = 200 Then
      Set objADOStream = CreateObject("ADODB.Stream")
      objADOStream.Open
      objADOStream.Type = 1 'adTypeBinary
 
      objADOStream.Write http.responsebody
      objADOStream.Position = 0    'Set the stream position to the start
 
      Set objfso = CreateObject("Scripting.FileSystemObject")
        If objfso.fileexists(path) Then objfso.DeleteFile path
      Set objfso = Nothing
 
      objADOStream.SaveToFile path
      objADOStream.Close
      Set objADOStream = Nothing
    End If
 
End Sub

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM