假设我有2个向量。我可以使用哪些算法进行比较？

Question

Company 1 has this vector: 公司1具有以下向量：

['books','video','photography','food','toothpaste','burgers'] ... ...

Company 2 has this vector: 公司2具有以下向量：

['video','processor','photography','LCD','power supply', 'books'] ... ...

Suppose this is a frequency distribution (I could make it a tuple but too much to type). 假设这是一个频率分布（我可以将其设为一个元组，但键入太多）。
As you can see...these vectors have things that overlap. 如您所见...这些向量具有重叠的事物。 "video" and "photography" seem to be "similar" between two vectors due to the fact that they are in similar positions. 由于两个向量的位置相似，因此“视频”和“摄影”在两个向量之间似乎“相似”。 And..."books" is obviously a strong point for company 1. Ordering and positioning does matter, as this is a frequency distribution. 而且...“书籍”显然是公司1的优势。订购和定位确实很重要，因为这是频率分布。

What algorithms could you use to play around with this? 您可以使用哪些算法来解决这个问题？ What algorithms could you use that could provide valuable data for these companies, using these vectors? 您可以使用哪些算法使用这些向量为这些公司提供有价值的数据？

I am new to text-mining and information-retrieval. 我是文本挖掘和信息检索的新手。 Could someone guide me about those topics in relation to this question? 有人可以指导我有关这些问题的话题吗？

Answer 1

I would suggest you a book called Programming Collective Intelligence . 我建议您读一本书，叫做《编程集体智慧》。
It's a very nice book on how you can retrieve information from simple data like this one. 这是一本非常不错的书，它介绍了如何从像这样的简单数据中检索信息。 There are code examples included (in Python :) 其中包含代码示例（在Python中：）

Edit: Just replying to gbjbaanb: This is Python! 编辑：只是回复gbjbaanb：这是Python！

a = ['books','video','photography','food','toothpaste','burgers']
b = ['video','processor','photography','LCD','power supply', 'books']
a = set(a)
b = set(b)

a.intersection(b)
    set(['photography', 'books', 'video'])

b.intersection(a)
    set(['photography', 'books', 'video'])

b.difference(a)
    set(['LCD', 'power supply', 'processor'])

a.difference(b)
    set(['food', 'toothpaste', 'burgers'])

Answer 2

Is position is very important, as you emphasize, then the crucial metric will be based on the difference of positions between the same items in the different vectors (you can, for example, sum the absolute values of the differences, or their squares). 如您所强调的，位置是非常重要的，那么关键指标将基于不同向量中相同项目之间位置的差异（例如，您可以求出差异的绝对值或它们的平方）。 The big issue that needs to be solved is -- how much to weigh an item that's present (say it's the N-th one) in one vector, and completely absent in the other. 需要解决的最大问题是-在一个向量中称量存在（例如，第N个）的项，而在另一个向量中则完全缺失。 Is that a relatively minor issue -- as if the missing item was actually present right after the actual ones, for example -- or a really, really big deal? 这是一个相对较小的问题吗？例如，好像丢失的物品实际上是在实际的物品之后立即出现的？还是真的非常重要？ That's impossible to say without more understanding of the actual application area. 如果不更了解实际的应用领域，这是不可能说的。 You can try various ways to deal with this issue and see what results they give on example cases you care about! 您可以尝试各种方法来解决此问题，并查看它们在您关心的示例案例中给出了什么结果！

For example, suppose "a missing item is roughly the same as if it were present, right after the actual ones". 例如，假设“缺少的项目与存在的项目大致相同，紧接在实际的项目之后”。 Then, you can preprocess each input vector into a dict mapping item to position (crucial optimization if you have to compare many pairs of input vectors!): 然后，您可以将每个输入向量预处理为dict映射项以定位（如果必须比较多对输入向量，则是至关重要的优化！）：

def makedict(avector):
  return dict((item, i) for i, item in enumerate(avector))

and then, to compare two such dicts: 然后，比较两个这样的命令：

def comparedicts(d1, d2):
  allitems = set(d1) | set(d2)      
  distances = [d1.get(x, len(d1)) - d2.get(x, len(d2)) for x in allitems]
  return sum(d * d for d in distances)

(or, abs(d) instead of the squaring in the last statement). （或abs（d），而不是最后一条语句中的平方）。 To make missing items weigh more (make dicts, ie vectors, be considered further away), you could use twice the lengths instead of just the lengths, or some large constant such as 100, in an otherwise similarly structured program. 为了使丢失的项目更重（使字典，即矢量，被认为距离较远），可以在其他类似结构的程序中使用两倍的长度而不是长度，或者使用一些较大的常量（例如100）。

Answer 3

看看汉明距离

Answer 4

As mbg mentioned, the hamming distance is a good start. 如mbg所述，汉明距离是一个好的开始。 It's basically assigning a bitmask for every possible item whether it is contained in the companies value. 基本上是为每个可能的项目分配一个位掩码，无论它是否包含在公司价值中。

Eg. 例如。 toothpaste is 1 for company A, but 0 for company B. You then count the bits which differ between the companies. 对于公司A，牙膏为1，对于公司B，牙膏为0。然后，您需要计算两家公司之间不同的位数。 The Jaccard coefficient is related to this. 雅卡德系数与此有关。

Hamming distance will actually not be able to capture similarity between things like "video" and "photography". 汉明距离实际上将无法捕获“视频”和“摄影”等事物之间的相似性。 Obviously, a company that sells one does sell the other also with higher probability than a company that sells toothpaste. 显然，与出售牙膏的公司相比，出售牙膏的公司出售另一牙的可能性也更高。

For this, you can use stuff like LSI (it's also used for dimensionality reduction) or factorial codes (eg neural network stuff as Restricted Boltzman Machines, Autoencoders or Predictablity Minimization) to get more compact representations which you can then compare using the euclidean distance. 为此，您可以使用LSI之类的东西（也用于降维）或阶乘代码（例如，神经网络的东西如Restricted Boltzman Machines，Autoencoders或Predictablity Minimization）来获得更紧凑的表示，然后可以使用欧氏距离进行比较。

Answer 5

pick the rank of each entry (higher rank is better) and make sum of geometric means between matches 选择每个条目的等级（等级越高越好），并在比赛之间进行几何均值求和

for two vectors 两个向量

sum(sqrt(vector_multiply(x,y)))  //multiply matches

Sum of ranks for each value over vector should be same for each vector (preferrebly 1) That way you can make compares between more than 2 vectors. 向量上每个值的等级总和应该与每个向量相同（最好是1），这样您就可以在两个以上的向量之间进行比较。

If you apply ikkebr's metod you can find how a is simmilar to b 如果您使用ikkebr的方法，您会发现a与b是相似的

in that case just use 在这种情况下，只需使用

sum( b( b.intersection(a) ))

Answer 6

You could use the set_intersection algorithm. 您可以使用set_intersection算法。 The 2 vectors must be sorted first (use sort call), then pass in 4 iterators and you'll get a collection back with the common elements inserted into it. 必须先对2个向量进行排序（使用sort调用），然后再传递4个迭代器，您将获得一个集合，其中插入了公共元素。 There are a few others that operate similarly. 还有其他一些类似的操作。

假设我有2个向量。我可以使用哪些算法进行比较？

问题描述

6 个解决方案

解决方案1
3 2009-11-26 22:48:30

解决方案2
3 已采纳 2009-11-27 00:18:47

解决方案3
2 2009-11-26 22:44:21

解决方案4
0 2009-11-26 22:59:01

解决方案5
0 2009-11-27 00:32:07

解决方案6
-1 2009-11-26 22:48:09

假设我有2个向量。 我可以使用哪些算法进行比较？

问题描述

6 个解决方案

解决方案1 3 2009-11-26 22:48:30

解决方案2 3 已采纳 2009-11-27 00:18:47

解决方案3 2 2009-11-26 22:44:21

解决方案4 0 2009-11-26 22:59:01

解决方案5 0 2009-11-27 00:32:07

解决方案6 -1 2009-11-26 22:48:09

假设我有2个向量。我可以使用哪些算法进行比较？

解决方案1
3 2009-11-26 22:48:30

解决方案2
3 已采纳 2009-11-27 00:18:47

解决方案3
2 2009-11-26 22:44:21

解决方案4
0 2009-11-26 22:59:01

解决方案5
0 2009-11-27 00:32:07

解决方案6
-1 2009-11-26 22:48:09