简体   繁体   English

选择哪个DB来查找最佳匹配记录?

[英]Which DB to choose for finding best matching records?

I'm storing an object in a database described by a lot of integer attributes. 我将对象存储在由许多整数属性描述的数据库中。 The real object is a little bit more complex, but for now let's assume that I'm storing cars in my database. 真实的对象有点复杂,但现在让我们假设我在我的数据库中存储汽车。 Each car has a lot of integer attributes to describe the car (ie. maximum speed, wheelbase, maximum power etc.) and these are searchable by the user. 每辆车都有很多整数属性来描述汽车(即最大速度,轴距,最大功率等),这些属性可由用户搜索。 The user defines a preferred range for each of the objects and since there are a lot of attributes there most likely won't be any car matching all the attribute ranges. 用户为每个对象定义了一个首选范围,因为有很多属性,所以很可能不会有任何匹配所有属性范围的汽车。 Therefore the query has to return a number of cars sorted by the best match. 因此,查询必须返回按最佳匹配排序的多个汽车。

At the moment I implemented this in MySQL using the following query: 目前我使用以下查询在MySQL中实现了这一点:

SELECT *, SQRT( POW((a < min_a)*(min_a - a) + (a > max_a)*(a - max_a), 2) +
                POW((b < min_b)*(min_b - b) + (b > max_b)*(b - max_b), 2) +
                ... ) AS match
WHERE a < (min_a - max_allowable_deviation) AND a > (max_a + max_allowable_deviation) AND ...
ORDER BY match ASC

where a and b are attributes of the object and min_a, max_a, min_b and max_b are user defined values. 其中a和b是对象的属性,min_a,max_a,min_b和max_b是用户定义的值。 Basically the match is the square root of the sum of the squared differences between the desired range and the real value of the attribute. 基本上,匹配是期望范围与属性的实际值之间的平方差之和的平方根。 A value of 0 meaning a perfect match. 值0表示完美匹配。

The table contains a couple of million records and the WHERE clausule is only introduced to limit the number of records the calculation is performed on. 该表包含几百万条记录,并且仅引入WHERE clausule以限制执行计算的记录数。 An index is placed on all of the queryable records and the query takes like 500ms. 索引放在所有可查询记录上,查询大约需要500毫秒。 I'd like to improve this number and I'm looking into ways to improve this query. 我想改进这个数字,我正在研究改进这个查询的方法。

Furthermore I am wondering whether there would be a different database better suited to perform this job. 此外,我想知道是否会有更适合执行此工作的不同数据库。 Moreover I'd very much like to change to a NoSQL database, because of its more flexible data scheme options. 此外,由于其更灵活的数据方案选项,我非常希望更改为NoSQL数据库。 I've been looking into MongoDB, but couldn't find a way to solve this problem efficiently (fast). 我一直在研究MongoDB,但找不到有效(快速)解决这个问题的方法。

Is there any database better suited for this job than MySQL? 有没有比MySQL更适合这项工作的数据库?

Take a look at R-trees . 看看R树 (The pages on specific variants go in to a lot more detail and present pseudo code). (特定变体的页面更详细,并提供伪代码)。 These data structures allow you to query by a bounding rectangle, which is what your problem of searching by ranges on each attribute is. 这些数据结构允许您通过边界矩形进行查询,这是您按每个属性的范围搜索的问题。

Consider your cars as points in n-dimensional space, where n is the number of attributes that describe your car. 将您的汽车视为n维空间中的点,其中n是描述汽车的属性数。 Then given an ranges, each describing an attribute, the problem is the find all the points contained in that n-dimensional hyperrectangle. 然后给出一个范围,每个范围描述一个属性,问题是找到该n维超矩形中包含的所有点。 R-trees support this query efficiently. R树有效地支持此查询。 MySQL implements R-trees for their spatial data types, but MySQL only supports two-dimensional space, which is insufficient for you. MySQL为其空间数据类型实现R树,但MySQL仅支持二维空间,这对您来说是不够的。 I'm not aware of any common databases that support n-dimensional R-trees off the shelf, but you can take some database with good support for user-defined tree data structures and implement R-trees yourself on top of that. 我不知道任何支持现成的n维R树的通用数据库 ,但你可以采用一些数据库,对用户定义的树数据结构有很好的支持,并自己实现R树。 For example, you can define a structure for an R-tree node in MongoDB, with child pointers. 例如,您可以使用子指针为MongoDB中的R树节点定义结构。 You will then implement the R-tree algorithms in your own code while letting MongoDB take care of storing the data. 然后,您将在自己的代码中实现R树算法,同时让MongoDB负责存储数据。

Also, there's this C++ header file implementing of an R-tree, but currently it's only an in-memory structure. 此外,这个C ++头文件实现了一个R树,但目前它只是一个内存结构。 Though if your data set is only a few million rows, it seems feasible to just load this memory structure upon startup and update it whenever new cars are added (which I assume is infrequent). 虽然如果你的数据集只有几百万行,那么在启动时加载这个内存结构似乎是可行的,并且每当添加新车时更新它(我认为这种情况并不常见)。

Text search engines, such as Lucene , meet your requirements very well. 文本搜索引擎,如Lucene ,可以很好地满足您的要求。 They allow you to "boost" hits depending on how they were matched, eg you can define engine size to be considered a "better match" than wheel base. 它们允许你“助推”命中率取决于他们是如何匹配的,例如,你可以定义引擎大小被认为比轴距一个“更好的匹配”。 Using lucene is really easy and above all, it's SUPER FAST . 使用lucene非常简单,最重要的是,它非常 Way faster than mysql. 比mysql快。

Mysql offer a plugin to provide text-based searching, but I prefer to use it separately, that way it's easily scalable (being read-only, you can have multiple lucene engines), and easily manageable. Mysql提供了一个插件来提供基于文本的搜索,但我更喜欢单独使用它,这样它很容易扩展(只读,你可以拥有多个lucene引擎),并且易于管理。

Also check out Solr , which sits on top of lucene and allows you to store, retrieve and search for simple java object (Lists, arrays etc). 另请查看Solr ,它位于lucene之上,允许您存储,检索和搜索简单的java对象(列表,数组等)。

Likely, your indexes aren't helping much, and I can't think of another database technology that's going to be significantly better. 可能,你的索引没有多大帮助,我想不出另外一种明显更好的数据库技术。 A few things to try with MySQL.... 使用MySQL尝试一些事情....

I'd try putting a copy of the data in a memory table. 我试着将数据的副本放在内存表中。 At least the table scans will be in memory.... http://dev.mysql.com/doc/refman/5.0/en/memory-storage-engine.html 至少表扫描将在内存中.... http://dev.mysql.com/doc/refman/5.0/en/memory-storage-engine.html

If that doesn't work for you or help much, you could also try a User Defined Function to optimize the calculation of the matching. 如果这对您不起作用或帮助不多,您还可以尝试用户定义函数来优化匹配的计算。 Basically, this means executing the range testing in a C library you provide: 基本上,这意味着在您提供的C库中执行范围测试:

http://dev.mysql.com/doc/refman/5.0/en/adding-functions.html http://dev.mysql.com/doc/refman/5.0/en/adding-functions.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从两个表MySQL查找匹配记录 - Finding matching records from two tables MySQL 复杂情况下应选择哪种数据库类型(关系型/ NoSQL和哪种类型) - Which DB type (relational / NoSQL and which type) to choose for a complex case 通过将ID集与字段中的逗号分隔ID进行匹配来查找记录 - Finding records by matching set of IDs against a comma separated IDs in a field 这是显示mysql记录的最佳分页 - Which is the best pagination for displaying mysql records 哪种数据库和哪种索引最适合这种情况? - Which DB and what Indexes best suit for the case? 获取联接表中只有1个匹配行的记录? - Getting records which have only 1 matching row in the joined table? 查找嵌套JSON中具有“ true”值的所有记录 - Finding all records which have a 'true' value in nested JSON 我应该为 mysql db 选择什么数据类型来存储包含特殊字符的段落文本? - What datatype should i choose for mysql db to store a paragraph text which includes special characters as well? 哪个是最好的工具从 MySQL DB 生成 Hibernate map - Which is the best tool Generate Hibernate map from MySQL DB 哪一个最适合用作db?varchar或number中的主键。 - which one is best for using as a primary key in db?varchar or number.
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM