[英]Pairwise Distance calculation (multidimentional matrix) for features similarity
Ok here is the formula in matlab: 好的,这是matlab中的公式:
function D = dumDistance(X,Y)
n1 = size(X,2);
n2 = size(Y,2);
D = zeros(n1,n2);
for i = 1:n1
for j = 1:n2
D(i,j) = sum((X(:,i)-Y(:,j)).^2);
end
end
Credits here (I know it's not a fast implementation but for the sake of the basic algorithm). 这里的功劳(我知道这不是一个快速的实现,而是出于基本算法的考虑)。
Now here is my understanding problem; 现在是我的理解问题;
Say that we have a matrix dictionary=140x100
words. 假设我们有一个矩阵
dictionary=140x100
单词。 And a matrix page=140x40
words. 矩阵
page=140x40
字。 Each column represents a word in the 140 dimensional space. 每列代表140维空间中的一个单词。
Now, if I use dumDistance(page,dictionairy)
it will return a 40x100
matrix with the distances. 现在,如果我使用
dumDistance(page,dictionairy)
,它将返回一个40x100
的距离矩阵。
What I want to achieve, is to find how close is each word of page
matrix to the dictionary
matrix, in order to represent the page according to dictionary with a histogram let's say. 我要实现的是找到
page
矩阵的每个单词与dictionary
矩阵有多近,以便用具有直方图的字典表示页面。
I know, that If I take the min(40x100), ill get a 1x100 matrix with locations of min values to represent my histogram. 我知道,如果我采用min(40x100),则将得到一个1x100矩阵,该矩阵的最小值表示我的直方图。
What I really cant understand here, is this 40x100 matrix. 我在这里真正无法理解的是这个40x100矩阵。 What data does this matrix represents anyway?
这个矩阵仍然代表什么数据? I cant visualize this in my mind.
我在脑海中无法想象这一点。
Minor comment before I start: 在我开始之前的小评论:
You should really use pdist2
instead. 您应该真正使用
pdist2
代替。 This is much faster and you'll get the same results as dumDistance
. 这要快得多,您将得到与
dumDistance
相同的结果。 In other words, you would call it like this: 换句话说,您可以这样称呼它:
D = pdist2(page.', dictionary.');
You need to transpose page
and dictionary
as pdist2
assumes that each row is an observation, while each column corresponds to a variable / feature. 您需要转置
page
和dictionary
因为pdist2
假设每一行都是一个观察值,而每一列都对应一个变量/功能。 Your data is structured such that each column is an observation. 您的数据结构使得每一列都是一个观察值。 This will return a
40 x 100
matrix like what you see in dumDistance
. 这将返回一个
40 x 100
矩阵,就像您在dumDistance
看到的dumDistance
。 However, pdist2
does not use for
loops . 但是,
pdist2
不for
loops 。
Now onto your question: 现在到您的问题:
D(i,j)
represents the Euclidean squared distance between word i
from your page and word j
from your dictionary. D(i,j)
表示欧氏字之间的平方距离i
从你的页面和文字j
从你的字典。 You have 40 words on your page and 100 words in your dictionary. 您的页面上有40个单词,而字典中有100个单词。 Each word is represented by a 140 dimensional feature vector, and so the rows of
D
index the words of page
while the columns of D
index the words of dictionary
. 每个单词都由140维特征向量表示,因此
D
的行索引page
的单词,而D
的列索引dictionary
。
What I mean here in terms of "distance" is in terms of the feature space. 我这里所说的“距离”是指特征空间。 Each word from your page and dictionary are represented as a 140 length vector.
页面和词典中的每个单词都表示为140个长度的向量。 Each entry
(i,j)
of D
takes the i th vector from page
and the j th vector from dictionary
, each of their corresponding components subtracted, squared, and then they are summed up. D
每个条目(i,j)
都从page
i 个向量中提取page
,第j 个向量则从dictionary
,它们各自的相应分量相减,平方后求和。 This output is then stored into D(i,j)
. 然后将此输出存储到
D(i,j)
。 This gives you the dissimilarity between word i
from your page
and word j
from your dictionary
at D(i,j)
. 这给你的字间的差异性
i
从你的page
和文字j
从你dictionary
在D(i,j)
The higher the value, the more dissimilar the two words are. 值越高,两个单词越相似 。
Minor Note: pdist2
computes the Euclidean distance while dumDistance
computes the Euclidean squared distance. 次要说明:
pdist2
计算欧几里得距离,而dumDistance
计算欧几里得平方距离。 If you want to have the same thing as dumDistance
, simply square every element in D
from pdist2
. 如果您想拥有与
dumDistance
相同的dumDistance
,只需将dumDistance
中D
每个元素平方pdist2
。 In other words, simply compute D.^2
. 换句话说,只需计算
D.^2
。
Hope this helps. 希望这可以帮助。 Good luck!
祝好运!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.