简体   繁体   中英

How to search through thousands of files to text efficiently in real-time

I'm working on refactoring a document storage service's site to go from a proprietary storage system to SQL. Everything is going fairly well, but I need to find a way to search through our repository for specific strings of text. We use a multitude of different file types (.xls,.xlsx,.doc,.txt, etc). They're displayed to the user by first converting them to a PDF, via line-by-line rebuilding using PDFSharp.

The speed isn't a consideration for viewing/searching a single file, but I have concerns about scalability. I was able to make a functioning text search by copying and then hooking into our conversion process, but I am fairly sure that this will not work for searching through a customer's entire document list (thousands and thousands of documents). If these were all of a uniform file type, it might be easier to do, but they aren't.

Is there an efficient way to do this of which I am unaware?

EDIT: The documents are stored on the server and referenced via document URLs in the DB

My recommendation is to build an index, either in SQL or in a file. One that matches files with all the possible search terms of interest in each file.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM