简体   繁体   中英

How to search for closest tag set match in JavaScript?

I have a set of documents, each annotated with a set of tags, which may contain spaces. The user supplies a set of possibly misspelled tags and I wants to find the documents with the highest number of matching tags (optionally weighted).

There are several thousand documents and tags but at most 100 tags per document.

I am looking on a lightweight and performant solution where the search should be fully on the client side using JavaScript but some preprocessing of the index with node.js is possible.

My idea is to create an inverse index of tags to documents using a multiset, and a fuzzy index that that finds the correct spelling of a misspelled tag, which are created in a preprocessing step in node.js and serialized as JSON files. In the search step, I want to consult for each item of the query set first the fuzzy index to get the most likely correct tag, and, if one exists to consult the inverse index and add the result set to a bag (numbered set). After doing this for all input tags, the contents of the bag, sorted in descending order, should provide the best matching documents.

My Questions

  1. This seems like a common problem, is there already an implementation for it that I can reuse? I looked at lunr.js and fuse.js but they seem to have a different focus.
    1. Is this a sensible approach to the problem? Do you see any obvious improvements?
    2. Is it better to keep the fuzzy step separate from the inverted index or is there a way to combine them?

You should be able to achieve what you want using Lunr, here is a simplified example (and a jsfiddle ):

var documents = [{
  id: 1, tags: ["foo", "bar"],
 },{
  id: 2, tags: ["hurp", "durp"]
}]

var idx = lunr(function (builder) {
  builder.ref('id')
  builder.field('tags')

  documents.forEach(function (doc) {
    builder.add(doc)
  })
})

console.log(idx.search("fob~1"))
console.log(idx.search("hurd~2"))

This takes advantage of a couple of features in Lunr:

  1. If a document field is an array, then Lunr assumes the elements are already tokenised, this would allow you to index tags that include spaces as-is, ie "foo bar" would be treated as a single tag (if this is what you wanted, it wasn't clear from the question)
  2. Fuzzy search is supported, here using the query string format. The number after the tilde is the maximum edit distance, there is some more documentation that goes into the details.

The results will be sorted by which document best matches the query, in simple terms, documents that contain more matching tags will rank higher.

Is it better to keep the fuzzy step separate from the inverted index or is there a way to combine them?

As ever, it depends. Lunr maintains two data structures, an inverted index and a graph. The graph is used for doing the wildcard and fuzzy matching. It keeps separate data structures to facilitate storing extra information about a term in the inverted index that is unrelated to matching.

Depending on your use case, it would be possible to combine the two, an interesting approach would be a finite state transducers, so long as the data you want to store is simple, eg an integer (think document id). There is an excellent article talking about this data structure which is similar to what is used in Lunr - http://blog.burntsushi.net/transducers/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM