简体   繁体   中英

PDF to plain text, Some difficult pages were encountered Adobe Acrobat XI

Basic Problem: For this PDF: https://1drv.ms/u/s?AsrLaUgt0KCLhXtP-jYDd4Z0ujKQ?e=xSu2ZR

I am unable to convert/Save manually as plain text using Adobe Acrobat XI standard or the batch conversion script (below). The generated file is blank.

Full problem: As part of my attempts to batch convert PDFs to text, I have run into a strange error where acrobat XI returns the following:

在此处输入图像描述

Disappointingly clicking ok generates the text file blank.

The following script to loop through PDF files and convert them to text files using acrobat: It works fine for most PDFs except ones with figures like above.

Sub LoopThroughFiles()
    Dim StrFile As String
    Dim pdfPath As String
    
    StrFile = Dir("C:\temp\PDFs\")
    fileRoot = "C:\temp\PDFs\"
    If Right(fileRoot, 1) <> "\" Then fileRoot = fileRoot & "\" 'ensure terminating \
    
    Do While Len(StrFile) > 0
        Debug.Print StrFile
        pdfPath = fileRoot & StrFile
        
        Debug.Print pdfPath
        
        success = ConvertPdf2(pdfPath, fileRoot & StrFile & ".txt")
        
        StrFile = Dir
        
        On Error Resume Next
        
        
    Loop
End Sub


'returns true if conversion was successful (based on whether `Open` succeeded or not)
Function ConvertPdf2(pdfPath As String, textPath As String) As Boolean
    Dim AcroXApp As Acrobat.AcroApp
    Dim AcroXAVDoc As Acrobat.AcroAVDoc
    Dim AcroXPDDoc As Acrobat.AcroPDDoc
    Dim jsObj As Object, success As Boolean

    Set AcroXApp = CreateObject("AcroExch.App")
    Set AcroXAVDoc = CreateObject("AcroExch.AVDoc")
    success = AcroXAVDoc.Open(pdfPath, "Acrobat") '<<< returns false if fails
    If success Then
    
Application.Wait (Now + TimeValue("0:00:2")) 'Helps PC have some time to go through data, can cause PC to freeze without


        Set AcroXPDDoc = AcroXAVDoc.GetPDDoc
        Set jsObj = AcroXPDDoc.GetJSObject
        jsObj.SaveAs textPath, "com.adobe.acrobat.plain-text"
        AcroXAVDoc.Close False
    End If
    AcroXApp.Hide
    AcroXApp.Exit
    ConvertPdf2 = success 'report success/failure
End Function

The error appears to be jsObj.SaveAs textPath, "com.adobe.acrobat.plain-text" If instead I use jsObj.SaveAs textPath, "com.adobe.acrobat.accesstext" the text file is generated but for my needs it is important the file generates is in the plain text format.

The reason for this can be seen below in a different PDF. These are the different types of text files generated:

Plain text (extends as sentences in the horizontal direction - this is required): 在此处输入图像描述

Access Text: (creates more of a body of text - this separated sentences by carriage return and is problematic) 在此处输入图像描述

I reckon this is a lost cause for these sorts of PDFs; disappointing, though, as many of the PDFs I need to convert are in this format. Appear to have been plagued with issues trying to solve this one.

Anyway just wondered if it may be possible to disable the popup message, and maybe this will allow the plain-text write to occur?

Alternatively can't think of much else.

It looks like your Acrobat version 11 has issues since "Works for Me" but using older version Reader 9, however its textport as plain text, is goingt to be what you get from pdftotext eg left aligned single lines, unsure if a 10 Pro or 20## might be good enough, when did Adobe massage the natural pdf output to richer?

Reader 9 export as plain text 在此处输入图像描述

在此处输入图像描述

Opening in other viewers works well enough to save as word or wordpad 在此处输入图像描述 在此处输入图像描述

Or edit the PDF before save as Docx or convert to text 在此处输入图像描述 在此处输入图像描述

The best bet is likely just to accept accessible text and transform this data to something that resembles plain text using logical rules.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM