简体繁体 English

PHP / MySQL-查找具有相似或匹配属性的项目

[英]PHP/MySQL - find items that have similar or matching properties

原文 2011-04-22 07:50:27 9 2 php/ mysql/ compare

I'm trying to develop a way of taking an entity with a number of properties and searching for similar entities in the database (matching as many of the properties in the correct order as possible). 我正在尝试开发一种采用具有多个属性的实体并在数据库中搜索相似实体的方法（以正确的顺序匹配尽可能多的属性）。 The idea is that it would then return a % of how similar it is. 想法是，它将返回相似程度的％。

The order of the properties should also be taken into account, so the properties at the beginning are more important than the ones at the end. 还应考虑属性的顺序，因此开头的属性比结尾的属性更重要。

For example: 例如：

Item 1 - A, B, C, D, E 项目1-A，B，C，D，E

Item 2 - A, B, C, D, E 项目2-A，B，C，D，E

Would be a 100% match 将会是100％匹配

Item 1 - A, B, C, D, E 项目1-A，B，C，D，E

Item 2 - B, C, A, D, E 项目2-B，C，A，D，E

This wouldn't be a perfect match as the properties are in a different order 这不是完美的匹配，因为属性的顺序不同

Item 1 - A, B, C, D, E 项目1-A，B，C，D，E

Item 2 - F, G, H, I, A 项目2-F，G，H，I，A

Would be a low match as only one property is the same and it is in position 5 匹配度较低，因为只有一个属性相同且位置5

This algorithm will run for thousands and thousands of records so it needs to be high performing and efficient. 该算法将运行成千上万条记录，因此它需要高性能和高效率。 Any thoughts as to how I could do this in PHP/MySQL in a fast and efficient manner? 关于如何在PHP / MySQL中快速有效地执行此操作的任何想法？

I was considering levenshtein but as far as I can tell that would also look at the distance between two completely different words in terms of spelling. 我当时正在考虑levenshtein，但据我所知，这也是从拼写角度看两个完全不同的单词之间的距离。 Doesn't appear to be ideal for this scenario unless I'm just using it in the wrong way.. 除非我只是以错误的方式使用它，否则在这种情况下似乎并不理想。

It might be that it could be done solely in MySQL, perhaps using a full text search or something. 可能可以仅在MySQL中完成，也可以使用全文搜索或其他方式。

This seems like a nice solution , though not designed for this scenario. 这似乎是一个不错的解决方案，尽管不是针对这种情况而设计的。 Perhaps binary comparison could be used in some way? 也许可以某种方式使用二进制比较？

2 个解决方案

what i'd do is encode the order and property value into a number. 我要做的是将订单和属性值编码为数字。 numbers have the advantage of fast comparisons. 数字具有快速比较的优势。

this is a general idea and may still need some work but i hope it would help in some way. 这是一个总体思路，可能仍需要一些工作，但我希望它会有所帮助。

calculate a number (some form of hash) for each property and multiply the number representative of the order of appearance the property for an item. 计算每个属性的数字（某种形式的哈希），然后将代表该属性的外观顺序的数字乘以该数字。

say item1 has 3 properties A, B and C. 假设item1具有3个属性A，B和C。

hash(A) = 123, hash(B) = 345, hash(C) = 456 hash（A）= 123，hash（B）= 345，hash（C）= 456

then multiply that by the order of appearance given that we have a know number of properties: 然后将其乘以外观顺序（假设我们具有已知的属性数量）：

(hash(A) * 1,000,00) + (hash(B) * 1,000) + (hash(C) * 1) = someval （哈希（A）* 1,000,00）+（哈希（B）* 1,000）+（哈希（C）* 1）=某个值

magnitude of the multiplier can be tweaked to reflect your data set. 可以调整乘数的大小以反映您的数据集。 you'll have to identify the hash function. 您将必须识别哈希函数。 soundex maybe? soundex也许吗？

the problem is now reduced to a question of uniqueness due to hash collisions but we can be pretty sure about properties that don't match. 现在，由于哈希冲突，问题被简化为唯一性问题，但是我们可以肯定不匹配的属性。

also, this would have the advantage of relative ease of checking if a property appears in another item in different order by using the magnitude of the multiplier to extract the hash value from the number generated. 同样，这将具有相对容易的优势，即通过使用乘数的大小从生成的数字中提取哈希值，可以比较容易地检查某个属性是否以其他顺序出现在另一个项目中。

HTH. HTH。

edit: example for checking matches 编辑：检查匹配的示例

given item1(abc) and item2(abc). 给定item1（abc）和item2（abc）。 the computed hash of items would be equal. 计算出的项目哈希将相等。 this is a best case scenario. 这是最好的情况。 no further computations are required. 不需要进一步的计算。

given item1(abc) and item2(dea). 给定item1（abc）和item2（dea）。 computed hash of items are not equal. 计算的项目哈希值不相等。 proceed to breaking down property hashes... 继续分解财产哈希...

say a hash table for properties a = 1, b = 2, c = 3, d = 4, e = 5 with 10^n for multiplier. 假设哈希表的属性为a = 1，b = 2，c = 3，d = 4，e = 5，乘数为10 ^ n。 computed hash for item1 is 123 and item2 is 451, break down the computed hash for each property and compare for all combinations of properties one for each item1 (which becomes item1(1 2 3) ) and item2 (which becomes item2(4 5 1) ). 计算出的item1的哈希值为123，item2的值为451，分解每个属性的计算哈希值，并比较属性的所有组合，每个item1（变为item1（1 2 3））和item2（变为item2（4 5 1）））。 then compute the score. 然后计算分数。

another way of looking at it would be comparing the properties one by one, except this time, you're playing with numbers instead of the actual string values 另一种看待它的方法是逐一比较属性，除了这次，您是在玩数字而不是实际的字符串值

You can draw inspiration (or flat out algorithms) from various sequence alignment algorithms like Smith-Waterman . 您可以从各种序列比对算法（例如Smith-Waterman ）中汲取灵感（或简化算法）。 Indeed what you're looking for very much seems to be a description of sequence alignment. 确实，您正在寻找的似乎是对序列比对的描述。 I am, however, uncertain if it's even possible to do this as an SQL query. 但是，我不确定是否有可能将其作为SQL查询来执行。