简体   繁体   中英

Interact with files client side in JavaScript

I need to extract the text of a PDF using only client side JavaScript.

I have this JSFiddle http://jsfiddle.net/go279m0h/

   document.getElementById('file').addEventListener('change', readFile, false);

 /** This works
 * Extract text from PDFs with PDF.js
 * https://mozilla.github.io/pdf.js/getting_started/
 */
pdfToText = function(data) {

    PDFJS.workerSrc = "{{ url_for('static', filename='js/pdf.worker.js') }}";
    PDFJS.cMapUrl = "{{ url_for('static', filename='cmaps') }}";
    PDFJS.cMapPacked = true;

    return PDFJS.getDocument(data).then(function(pdf) {
        var pages = [];
        for (var i = 0; i < pdf.numPages; i++) {
            pages.push(i);
        }
        return Promise.all(pages.map(function(pageNumber) {
            return pdf.getPage(pageNumber + 1).then(function(page) {
                return page.getTextContent().then(function(textContent) {
                    return textContent.items.map(function(item) {
                        return item.str;
                    }).join(' ');
                });
            });
        })).then(function(pages) {
            return pages.join("\r\n");
        });
    });
}



    // this function should get the text of a pdf file and print it to the console.  
   function readFile (evt) {
       var files = evt.target.files;
       var file = files[0];

       // following from https://stackoverflow.com/questions/1554280/extract-text-from-pdf-in-javascript
       // using PDFJS function 
       self.pdfToText(files[0].path).then(function(result) {
           console.log("PDF done!", result);
       })


       /*
       var reader = new FileReader();
       reader.onload = function() {
         console.log(this.result);            
       }
       //reader.readAsText(file)
       */
    }

The PDF JS function to get text from the PDF currently works with a server side file path, BUT I can't get it to accept the files[0] argument for the file the user uploads.

The error I keep getting when I run this is "Uncaught Error: Invalid parameter in getDocument, need either Uint8Array, string or a parameter object"

The second option from the bottom was where I got the function, that I was able to use for text extraction. extract text from pdf in Javascript

Javascript is generally made safer by running it in a "sandbox", a virtual environment that limits or outright denies access to the host filesystem. Most -- if not all -- browsers use this approach. That said, it's usually allowed to read things, so security shouldn't be a problem...

Looking at the defintion of function 'pdfToText', it appears to want "raw" data, often simply thought of as an array of bytes which may actually be packed into something else, for instance, an array of 32-bit floats, each taking up 4 bytes in a series of 8-bit bytes ( properly called 'octets' ).

Looking at the error message, and then the call to 'pdfToText', it appears that instead of a raw buffer, you are passing a string representing the file chosen in the file-requestor dialog. It seems you will need to find a function that can read the file located at that path as a "raw" stream of bytes, and then you can feed that stream ( buffer into an array of bytes, I suppose ) to 'pdfToText'. That should fix it.

https://developer.mozilla.org/en-US/docs/Web/API/FileReader/readAsBinaryString

The commented-out chunk at the bottom is a good start; you could replace 'readAsText(file)' with 'readAsBinaryString(file)'... but reading a bit further, I see you'll need some kind of "it's done reading" handler; looks like 'result' attribute will contain the buffer you can pass to 'pdfToText'. So you'll have to re-arrange things to have the the 'pdfToText' call happen inside the handler that is called when the file has been read. Comment if you get stuck.

The examples at https://mozilla.github.io/pdf.js/examples/ suggest that it should be possible to pass a string representing a file path, but maybe there is some problem due to the aforementioned security stuff.

I'm still pretty new to Javascript, and I welcome corrections on the various claims I've made. :-)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM