简体   繁体   中英

Searching through PDF text with Node.js

I have thousands of searchable PDFs, some of which are up to a 1GB with over 2000 pages. I need to be able to search for a text string in these files using a Node.js app.

Right now, files are stored in a Google Cloud Storage bucket.

What's the best way to do this?

Some options:

  • Read the text from PDF files into MySQL using something like NPM package pdf-text-extract . Then use MySQL queries to search for text strings.
  • Search the PDF files directly using some NPM package.

Am I completely off? Is there a better way?

There are dedicated text search libraries out there, like this one , or this . Most likely you'd need to extract plain text from each pdf, save and index them. Then you'll be able to run search queries. Setting up database for this particular task may be an overkill.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM