简体   繁体   中英

Get Text from PDF to Text conversion with Google apps script

I have a script which gets (searchable) PDF attachments from certain Gmail messages.

Now I need to extract some String data from these pdfs.

Is there some way to add it to Google Drive with OCR-conversion enabled and to extract the text from that file? Or is there even a better way to solve my problem?

you say you start with "searchable" pdf attachments, I assume by that you mean they dont actually have text-type content, but instead are scanned documents with text on the pdf image. Google will automatically perform OCR on them if you store them in Drive, however that OCR is not stored as part of the file content, its only used to index the document so it can be later found using drive search (ie its internal for drive use, not exposed).

However, you might want to try this DocsList api https://developers.google.com/apps-script/reference/docs-list/file#getContentAsString() which could work on your pdfs if they actually have text (and not text-on-image) on them.

Is there some way to add it to Google Drive with OCR-conversion enabled and to extract the text from that file? Or is there even a better way to solve my problem?

The pdfToText() utility from Get pdf-attachments from Gmail as text uses the advanced Drive service and DocumentApp to convert PDF to Google-Doc to text. You can get the OCR'd text this way, or save it directly to a txt file in any folder on your Drive.

This is a solution. You must activate Drive API in Console developper.

Script to convert attachment to texte

function uploadFile() {
  var search = "label:inbox";
  var threads = GmailApp.search(search, 0, 2);
    for (var i=0; i<threads.length; i++) {
      var messages = GmailApp.getMessagesForThread(threads[i]);
      for (var j=0; j<messages.length; j++) {
        var email = messages[j]; 
        var sujet = email.getSubject();
        var data = email.getAttachments()[0];
        if (data){
          var file = {
          title: sujet,
          mimeType: 'image/png'
         };
     var image = data;
    file = Drive.Files.insert(file, image, {ocr: true});
    var body = DocumentApp.openById(file.id).getBody();
    var imgs = body.getImages();
    for (var i = 0; i < imgs.length; i++) {
    imgs[i].removeFromParent();
    }
    }
    }
    }

///////////Script to convert external file to text
function uploadFile(){
var image = UrlFetchApp.fetch('http://web.engr.oregonstate.edu/~dambrobr/classes/cs532/muggleton94inductive.pdf').getBlob();
var file = {title: 'IA',mimeType: 'image/png'};
file = Drive.Files.insert(file, image, {ocr: true});
var body = DocumentApp.openById(file.id).getBody();
var imgs = body.getImages();
for (var i = 0; i < imgs.length; i++) {
imgs[i].removeFromParent();
}
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM