Which DB to choose for finding best matching records?

Question

I'm storing an object in a database described by a lot of integer attributes. The real object is a little bit more complex, but for now let's assume that I'm storing cars in my database. Each car has a lot of integer attributes to describe the car (ie. maximum speed, wheelbase, maximum power etc.) and these are searchable by the user. The user defines a preferred range for each of the objects and since there are a lot of attributes there most likely won't be any car matching all the attribute ranges. Therefore the query has to return a number of cars sorted by the best match.

At the moment I implemented this in MySQL using the following query:

SELECT *, SQRT( POW((a < min_a)*(min_a - a) + (a > max_a)*(a - max_a), 2) +
                POW((b < min_b)*(min_b - b) + (b > max_b)*(b - max_b), 2) +
                ... ) AS match
WHERE a < (min_a - max_allowable_deviation) AND a > (max_a + max_allowable_deviation) AND ...
ORDER BY match ASC

where a and b are attributes of the object and min_a, max_a, min_b and max_b are user defined values. Basically the match is the square root of the sum of the squared differences between the desired range and the real value of the attribute. A value of 0 meaning a perfect match.

The table contains a couple of million records and the WHERE clausule is only introduced to limit the number of records the calculation is performed on. An index is placed on all of the queryable records and the query takes like 500ms. I'd like to improve this number and I'm looking into ways to improve this query.

Furthermore I am wondering whether there would be a different database better suited to perform this job. Moreover I'd very much like to change to a NoSQL database, because of its more flexible data scheme options. I've been looking into MongoDB, but couldn't find a way to solve this problem efficiently (fast).

Is there any database better suited for this job than MySQL?

Answer 1

Take a look at R-trees . (The pages on specific variants go in to a lot more detail and present pseudo code). These data structures allow you to query by a bounding rectangle, which is what your problem of searching by ranges on each attribute is.

Consider your cars as points in n-dimensional space, where n is the number of attributes that describe your car. Then given an ranges, each describing an attribute, the problem is the find all the points contained in that n-dimensional hyperrectangle. R-trees support this query efficiently. MySQL implements R-trees for their spatial data types, but MySQL only supports two-dimensional space, which is insufficient for you. I'm not aware of any common databases that support n-dimensional R-trees off the shelf, but you can take some database with good support for user-defined tree data structures and implement R-trees yourself on top of that. For example, you can define a structure for an R-tree node in MongoDB, with child pointers. You will then implement the R-tree algorithms in your own code while letting MongoDB take care of storing the data.

Also, there's this C++ header file implementing of an R-tree, but currently it's only an in-memory structure. Though if your data set is only a few million rows, it seems feasible to just load this memory structure upon startup and update it whenever new cars are added (which I assume is infrequent).

Answer 2

Text search engines, such as Lucene , meet your requirements very well. They allow you to "boost" hits depending on how they were matched, eg you can define engine size to be considered a "better match" than wheel base. Using lucene is really easy and above all, it's SUPER FAST . Way faster than mysql.

Mysql offer a plugin to provide text-based searching, but I prefer to use it separately, that way it's easily scalable (being read-only, you can have multiple lucene engines), and easily manageable.

Also check out Solr , which sits on top of lucene and allows you to store, retrieve and search for simple java object (Lists, arrays etc).

Answer 3

Likely, your indexes aren't helping much, and I can't think of another database technology that's going to be significantly better. A few things to try with MySQL....

I'd try putting a copy of the data in a memory table. At least the table scans will be in memory.... http://dev.mysql.com/doc/refman/5.0/en/memory-storage-engine.html

If that doesn't work for you or help much, you could also try a User Defined Function to optimize the calculation of the matching. Basically, this means executing the range testing in a C library you provide:

http://dev.mysql.com/doc/refman/5.0/en/adding-functions.html

Which DB to choose for finding best matching records?

Question

3 answers

solution1
4 ACCPTED 2011-07-24 02:10:02

solution2
2 2011-07-24 01:30:34

solution3
1 2011-07-24 03:32:37

Which DB to choose for finding best matching records?

Question

3 answers

solution1 4 ACCPTED 2011-07-24 02:10:02

solution2 2 2011-07-24 01:30:34

solution3 1 2011-07-24 03:32:37

solution1
4 ACCPTED 2011-07-24 02:10:02

solution2
2 2011-07-24 01:30:34

solution3
1 2011-07-24 03:32:37