简体   繁体   中英

What “big-data” algorithms can I use in order to analyze similarites between text files?

I would like to create a system which gets a lot of text files (this system gets some text files every 2 minutes) and find the ones which have the at least one common sentence. What algorithms can I use in order to do that?

thank you

One very simple way would be to parse each text file as you get it and create a database that contains the sentences and which documents contain that sentence. That is, you'd have something like:

Sentences table
Key - a unique sequential integer
Hash - a 32-bit or 64-bit hash code created from the sentence
Text - The full sentence text

Files table
Key - a unique sequential integer
Name - the file's name

Associations table
FileKey
SentenceKey

So when you parse a sentence, create the hash code and query the database for all sentences that contain that hash code. There could be multiples. If no sentence is found, or if you encounter a duplicate hash code (ie the hashes match but the sentence texts are different), then you create a new entry in the Sentences table. In either case, you make an entry in the Associations table, saying "this file contains this sentence."

You could build your list of files that contain common sentences at the same time you're parsing. All you'd have to do is output the common files any time you find a match.

If you want to query the data later, you can sort the Associations table by SentenceKey, eliminate those sentences that occur in only one file, and you end up with with the duplicates.

That's the broad strokes. I glossed over a few implementation details, but there's no real heavy lifting involved.

Also, you don't have to use a DBMS to do this. If you have enough memory, you could do it with in-memory data structures. But the database is quite convenient in that it persists the information and it's designed to do stuff like this.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM