简体   繁体   English

C#搜索PDF

[英]C# Searching PDFs

I'm using iTextSharp to get the content out of a pdf. 我正在使用iTextSharp从pdf中获取内容。 I want to allow the user to search for PDFs, much like they do on any search engine. 我想允许用户搜索PDF,就像在任何搜索引擎上一样。 The search should return the most relevant results. 搜索应返回最相关的结果。 I have written a library that performs the TF-IDF algorithm on the documents to return relevant results. 我已经编写了一个在文档上执行TF-IDF算法以返回相关结果的库。 While this works, I feel like I may be reinventing the wheel. 在此过程中,我觉得自己可能正在重新发明轮子。

This user should be able to search well over 50,000 PDFs. 该用户应该能够搜索超过50,000个PDF。 So there's alot of them. 所以有很多。 I don't want to store the full content of the PDF in my database as I feel that would be SUPER expensive. 我不想将PDF的全部内容存储在我的数据库中,因为我认为这会非常昂贵。 To mitigate this, I've written my library so that it will accept a frequency distribution when calculating TF-IDF. 为了减轻这种情况,我已经编写了库,以便在计算TF-IDF时它将接受频率分布。 This allows me to read the PDF when it's added to the system instead of every time a search is performed. 这样,当我将PDF添加到系统中时,而不是每次执行搜索时,都可以阅读它。

Do libraries exist that already do this sort of thing? 是否存在已经在执行此类操作的库?

Lucene.NET will do what you need. Lucene.NET将满足您的需求。

And there are commercial ones like our 'SearchUnit' 还有一些商业广告,例如我们的“ SearchUnit”

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM