简体   繁体   中英

C# Searching PDFs

I'm using iTextSharp to get the content out of a pdf. I want to allow the user to search for PDFs, much like they do on any search engine. The search should return the most relevant results. I have written a library that performs the TF-IDF algorithm on the documents to return relevant results. While this works, I feel like I may be reinventing the wheel.

This user should be able to search well over 50,000 PDFs. So there's alot of them. I don't want to store the full content of the PDF in my database as I feel that would be SUPER expensive. To mitigate this, I've written my library so that it will accept a frequency distribution when calculating TF-IDF. This allows me to read the PDF when it's added to the system instead of every time a search is performed.

Do libraries exist that already do this sort of thing?

Lucene.NET will do what you need.

And there are commercial ones like our 'SearchUnit'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM