简体   繁体   中英

Compare documents by sequence vector

I'm trying to classify documents by sequence vector. Basically, I have a vocabulary (more than 5000 words). Each document is converted to a vector of integer numbers so that each element in the vector corresponds the position of the word in the vocabulary.

For example, if the vocab is [hello, how, are, you, today] and the document is "hello you" then I'll have the vector: [1 4] .
Another document of "how are you" will result in [2 3 4] .

Now what I want is to assess the similarity between the first and the second vector. Here you can see these vectors don't have the same length. Furthermore, comparing directly them may not make sense because they represent sequence of words. This case is different from binary (bag-of-word) vector, which considers the appearance of a word in the document (1 if appear, otherwise 0), and also frequency (word count) vector, which considers frequency of a word in the document with the given vocabulary.
Can you give me a suggestion?

The Jaccard similarity is normally used to compare the similarity of sets (in your case, text). The text is n-grammed (shingled), and then locality sensitive hashing is used to determine their Jaccard similarity.

There is a whole field dedicated to this - Google is your friend!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM