简体   繁体   中英

Extract images from PDF file with JavaScript

I want to write JavaScript code to extract all image files from a PDF file, perhaps getting them as JPG or some other image format. There is already some JavaScript code for reading a PDF file, for example in the PDF viewer pdf-js .

window.addEventListener('change', function webViewerChange(evt) {
  var files = evt.target.files;
  if (!files || files.length === 0)
    return;

  // Read the local file into a Uint8Array.
  var fileReader = new FileReader();
  fileReader.onload = function webViewerChangeFileReaderOnload(evt) {
    var buffer = evt.target.result;
    var uint8Array = new Uint8Array(buffer);
    PDFView.open(uint8Array, 0);
  };

  var file = files[0];
  fileReader.readAsArrayBuffer(file);
  PDFView.setTitleUsingUrl(file.name);
  ........

Can I use this code to help read and extract the image files?

If you open a page with pdf.js , for example

PDFJS.getDocument({url: <pdf file>}).then(function (doc) {
    doc.getPage(1).then(function (page) {
        window.page = page;
    })
})

you can then use getOperatorList to search for paintJpegXObject objects and grab the resources.

window.objs = []
page.getOperatorList().then(function (ops) {
    for (var i=0; i < ops.fnArray.length; i++) {
        if (ops.fnArray[i] == PDFJS.OPS.paintJpegXObject) {
            window.objs.push(ops.argsArray[i][0])
        }
    }
})

Now args will have a list of the resources from that page that you need to fetch.

console.log(window.args.map(function (a) { page.objs.get(a) }))

should print to the console a bunch of <img /> objects with data-uri src= attributes. These can be directly inserted into the page, or you can do more scripting to get at the raw data.

It only works for embedded JPEG objects, but it's a start!

Here is link<\/a> to working example of getting images from pdf and adding alpha channel to Uint8ClampedArray to be able to display it. It displays images in canvas.

const canvas = document.createElement('canvas');
canvas.width = imageWidth;
canvas.height = imageHeight;
const ctx = canvas.getContext('2d');
ctx!.putImageData(imageData, 0, 0);
const dataURL = canvas.toDataURL();

Hello guys i've created a package for extract images from PDF , if you need get images from PDF you should use this package, it will be return images like base64 format.

https://www.npmjs.com/package/pdf-pages-to-base64-images

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM