简体繁体 English

可以使用什么机制来量化非数字列表之间的相似性？

[英]What mechanism can be used to quantify similarity between non-numeric lists?

原文 2017-06-16 16:05:46 9 1 cluster-analysis/ phash/ ssim/ bigdata/ nosql

I have a database of recipes which is essentially structured as a list of ingredients and their associated quantities. 我有一个食谱数据库，基本上是一个成分列表及其相关数量。 If you are given a recipe how would you identify similar recipes allowing for variations and omissions? 如果您获得食谱，您如何识别允许变化和遗漏的类似食谱？ For example using milk instead of water, or honey instead of sugar or entirely omitting something for flavour. 例如，使用牛奶代替水，或用蜂蜜代替糖，或者完全省略某些东西以获得风味。

The current strategy is to do multiple inner joins for combinations of the main ingredients but this is can be exceedingly slow with a large database. 目前的策略是对主要成分的组合进行多个内部连接，但是对于大型数据库来说这可能会非常慢。 Is there another way to do this? 还有另一种方法吗？ Something to the equivalent of perceptual hashing would be ideal! 相当于感知哈希的东西是理想的！

1 个解决方案

How about cosine similarity ? 余弦相似度怎么样？

This technique is commonly used in Machine Learning for text recognition as a similarity measure . 该技术通常用于机器学习中，用于文本识别作为相似性度量 。 With it, you can calculate the distance between two texts (actually, between any two vectors) which can be interpreted as how much are those texts alike (the closer, the more alike). 有了它，你可以计算两个文本之间的距离（实际上，在任意两个向量之间），可以解释为这些文本的数量相同（越接近，越相似）。

Take a look at this great question that explains cosine similarity in a simple way. 看看这个以简单方式解释余弦相似性的好问题。 In general, you could use any similarity measure to obtain a distance to compare your recipe. 通常，您可以使用任何相似性度量来获得比较您的食谱的距离。 This article talks about different similarity measures, you can check it out if you wish to know more. 这一篇关于不同的相似性措施的谈判，你可以检查出来，如果你想知道更多。