[英]How to convert pdf to text file using vba but not using acrobat library
[英]Convert PDF to text file using VBA and Adobe Acrobat XI standard
上一篇文章的第 3 部分。
任務:我正在嘗試遍歷 excel 中提供的一系列 URL,並為每個 URL 生成完整的文本文件。
至此:上一篇中的VBA解決方案使用Word打開PDF Url,由此生成文本文件。 這比使用電源查詢要好得多。 但是,我遇到了一個失敗的實例,因為它將第一頁上的文本識別為圖像; 因此,在生成的文本文件中省略了該文本。
作為生成的截斷文本文件的示例: https://hpvchemicals.oecd.org/ui/handler.axd?id=b4b38713-7580-4843-86fd-614821a6f72b是一個示例,其中生成的 ZBCD1B68640 缺少 textAB59A8DFC1ZFF0 文件
按照建議,我已經下載了 Adobe Acrobat(Adobe Acrobat XI Standard)的一個版本,它啟用了Adobe Acrobat 10.0 Type Library
參考(以及其他一些)。 我希望這將實現更准確的文本轉換,但需要幫助來開發此腳本。
所以,對於這個問題,我想修改下面的代碼,把 Word 換成 Abode 來執行轉換。 然而,我不確定的是,Adobe 是否可以像 word 一樣從 URL 打開 PDF。 我想我可能需要介紹一些步驟來首先在文件夾中下載 PDF,然后為每個轉換這些。 理想情況下,它會像上面的代碼一樣工作,但如果這是必要的,那就這樣吧。
上一個答案的 M 代碼:
Sub Tester()
Dim filePath As String
Dim fso As FileSystemObject, url
Dim fileStream As TextStream, ws As Worksheet
Dim oWd As Object, oDoc As Object, c As Range, fileRoot As String
Set fso = New FileSystemObject
Set oWd = CreateObject("word.application")
Set ws = Worksheets("Data") 'use a specific worksheet reference
fileRoot = ws.Range("D2").Value 'read this once
If Right(fileRoot, 1) <> "\" Then fileRoot = fileRoot & "\" 'ensure terminating \
For Each c In ws.Range("B2:B" & ws.Cells(Rows.Count, "B").End(xlUp).row).Cells
url = Trim(c.Value)
If LCase(url) Like "http?:*" Then 'has a URL
Set oDoc = Nothing
On Error Resume Next 'ignore error if no document...
Set oDoc = oWd.Documents.Open(url)
On Error GoTo 0 'stop ignoring errors
If Not oDoc Is Nothing Then
filePath = fileRoot & c.Offset(0, -1).Value & ".txt" 'filename from ColA
Debug.Print filePath
'open text stream as unicode
Set fileStream = fso.CreateTextFile(filePath, overwrite:=True, Unicode:=True)
fileStream.Write oDoc.Range.Text
fileStream.Close
oDoc.Close
c.Interior.Color = vbGreen 'flag OK
Else
c.Interior.Color = vbRed 'flag problem
End If
End If 'have url
Next c
oWd.Quit
End Sub
至於開發這個腳本,我從網上找的有以下,現在用Acrobat成功的對之前比較難的PDF進行了轉換,所以很好的開始。
子轉換pdf2()
Dim AcroXApp As Acrobat.AcroApp
Dim AcroXAVDoc As Acrobat.AcroAVDoc
Dim AcroXPDDoc As Acrobat.AcroPDDoc
Dim Filename As String
Dim jsObj As Object
Dim NewFileName As String
Filename = "C:\temp\name1.pdf"
NewFileName = "C:\temp\name1.txt"
Set AcroXApp = CreateObject("AcroExch.App")
'AcroXApp.Show
Set AcroXAVDoc = CreateObject("AcroExch.AVDoc")
AcroXAVDoc.Open Filename, "Acrobat"
Set AcroXPDDoc = AcroXAVDoc.GetPDDoc
Set jsObj = AcroXPDDoc.GetJSObject
jsObj.SaveAs NewFileName, "com.adobe.acrobat.plain-text"
AcroXAVDoc.Close False
AcroXApp.Hide
AcroXApp.Exit
End Sub
其中 name1 是下載的 PDF。
所以,本質上我試圖:
我想如果我能做到這一點,那么我將能夠弄清楚如何將它與上一篇文章合並,以便它可以循環通過給定數量的 URL 來生成文本文件以及處理非 unicofe 字符。
如果我有任何進展會更新
網址:
https://hpvchemicals.oecd.org/ui/handler.axd?id=e19d2799-0c16-496d-a607-b09330dd28a7
https://hpvchemicals.oecd.org/ui/handler.axd?id=40da06b1-a855-4c0c-bc21-bbc856dca725
https://hpvchemicals.oecd.org/ui/handler.axd?id=c4967546-1f5e-472a-b629-a2998323735b
https://hpvchemicals.oecd.org/ui/handler.axd?id=bde5e625-83ee-423d-aa70-eb0e453088e4
https://hpvchemicals.oecd.org/ui/handler.axd?id=621c4f55-ef3c-4b99-bb98-e6aaf3f436dd
https://hpvchemicals.oecd.org/ui/handler.axd?id=26e1420d-f9b7-4768-b6fa-d345f54e7683
https://hpvchemicals.oecd.org/ui/handler.axd?id=263f3491-90c7-4c3a-b43e-4c4e9395bcea
https://hpvchemicals.oecd.org/ui/handler.axd?id=b78d39a9-26c2-48ff-aadc-cb056a89f08b
https://hpvchemicals.oecd.org/ui/handler.axd?id=97a7b56f-ebaf-4416-8b4b-88b19ca3bd16
https://hpvchemicals.oecd.org/ui/handler.axd?id=c6c3b7c1-9239-40d9-b51a-85a15e2411d6
更新:
測試:
#If VBA7 Then
Private Declare PtrSafe Function URLDownloadToFile Lib "urlmon" _
Alias "URLDownloadToFileA" (ByVal pCaller As LongPtr, ByVal szURL As String, _
ByVal szFileName As String, ByVal dwReserved As Long, ByVal lpfnCB As LongPtr) As Long
#Else
Private Declare Function URLDownloadToFile Lib "urlmon" _
Alias "URLDownloadToFileA" (ByVal pCaller As Long, ByVal szURL As String, _
ByVal szFileName As String, ByVal dwReserved As Long, ByVal lpfnCB As Long) As Long
#End If
Sub Tester()
Dim filePath As String
Dim fso As FileSystemObject, url
Dim fileStream As TextStream, ws As Worksheet
Dim c As Range, fileRoot As String, pdfPath As String
Set ws = Worksheets("Data") 'use a specific worksheet reference
fileRoot = ws.Range("D2").Value 'read this once
If Right(fileRoot, 1) <> "\" Then fileRoot = fileRoot & "\" 'ensure terminating \
For Each c In ws.Range("B2:B" & ws.Cells(Rows.Count, "B").End(xlUp).row).Cells
url = Trim(c.Value)
If LCase(url) Like "http?:*" Then 'has a URL
pdfPath = fileRoot & "PDF_" & c.Offset(0, -1).Value & ".pdf"
DownloadFile url, pdfPath
ConvertPdf2 pdfPath, fileRoot & c.Offset(0, -1).Value & ".txt"
End If 'have url
Next c
End Sub
Function DownloadFile(sURL, sSaveAs) As Boolean
DownloadFile = (URLDownloadToFile(0, sURL, sSaveAs, 0, 0) = 0)
End Function
Sub ConvertPdf2(pdfPath As String, textPath As String)
Dim AcroXApp As Acrobat.AcroApp
Dim AcroXAVDoc As Acrobat.AcroAVDoc
Dim AcroXPDDoc As Acrobat.AcroPDDoc
Dim jsObj As Object
Set AcroXApp = CreateObject("AcroExch.App")
Set AcroXAVDoc = CreateObject("AcroExch.AVDoc")
AcroXAVDoc.Open pdfPath, "Acrobat"
Set AcroXPDDoc = AcroXAVDoc.GetPDDoc
Set jsObj = AcroXPDDoc.GetJSObject
jsObj.SaveAs textPath, "com.adobe.acrobat.plain-text"
AcroXAVDoc.Close False
AcroXApp.Hide
AcroXApp.Exit
End Sub
我不在 position 中幫助自動化 acrobat,但這是我用來下載 PDF 的程序。 將 URL 提供給 PDF 文件和保存文件的本地路徑。 僅適用於 Excel 的 windows 版本。
Sub saveFile(url As String, Optional path As String)
Dim http As Object
Dim objfso As Object
Dim objADOStream As Object
If path = "" Then path = ThisWorkbook.path & "\file.pdf"
Set http = CreateObject("MSXML2.serverXMLHTTP")
http.Open "GET", url, False
http.Send
If http.Status = 200 Then
Set objADOStream = CreateObject("ADODB.Stream")
objADOStream.Open
objADOStream.Type = 1 'adTypeBinary
objADOStream.Write http.responsebody
objADOStream.Position = 0 'Set the stream position to the start
Set objfso = CreateObject("Scripting.FileSystemObject")
If objfso.fileexists(path) Then objfso.DeleteFile path
Set objfso = Nothing
objADOStream.SaveToFile path
objADOStream.Close
Set objADOStream = Nothing
End If
End Sub
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.