简体   繁体   English

特征相似度的成对距离计算(多维矩阵)

[英]Pairwise Distance calculation (multidimentional matrix) for features similarity

Ok here is the formula in matlab: 好的,这是matlab中的公式:

function D = dumDistance(X,Y)
n1 = size(X,2);
n2 = size(Y,2);
D = zeros(n1,n2);
for i = 1:n1
    for j = 1:n2
        D(i,j) = sum((X(:,i)-Y(:,j)).^2);
    end
end

Credits here (I know it's not a fast implementation but for the sake of the basic algorithm). 这里的功劳(我知道这不是一个快速的实现,而是出于基本算法的考虑)。

Now here is my understanding problem; 现在是我的理解问题;

Say that we have a matrix dictionary=140x100 words. 假设我们有一个矩阵dictionary=140x100单词。 And a matrix page=140x40 words. 矩阵page=140x40字。 Each column represents a word in the 140 dimensional space. 每列代表140维空间中的一个单词。

Now, if I use dumDistance(page,dictionairy) it will return a 40x100 matrix with the distances. 现在,如果我使用dumDistance(page,dictionairy) ,它将返回一个40x100的距离矩阵。

What I want to achieve, is to find how close is each word of page matrix to the dictionary matrix, in order to represent the page according to dictionary with a histogram let's say. 我要实现的是找到page矩阵的每个单词与dictionary矩阵有多近,以便用具有直方图的字典表示页面。

I know, that If I take the min(40x100), ill get a 1x100 matrix with locations of min values to represent my histogram. 我知道,如果我采用min(40x100),则将得到一个1x100矩阵,该矩阵的最小值表示我的直方图。

What I really cant understand here, is this 40x100 matrix. 我在这里真正无法理解的是这个40x100矩阵。 What data does this matrix represents anyway? 这个矩阵仍然代表什么数据? I cant visualize this in my mind. 我在脑海中无法想象这一点。

Minor comment before I start: 在我开始之前的小评论:

You should really use pdist2 instead. 您应该真正使用pdist2代替。 This is much faster and you'll get the same results as dumDistance . 这要快得多,您将得到与dumDistance相同的结果。 In other words, you would call it like this: 换句话说,您可以这样称呼它:

D = pdist2(page.', dictionary.');

You need to transpose page and dictionary as pdist2 assumes that each row is an observation, while each column corresponds to a variable / feature. 您需要转置pagedictionary因为pdist2假设每一都是一个观察值,而每一列都对应一个变量/功能。 Your data is structured such that each column is an observation. 您的数据结构使得每一都是一个观察值。 This will return a 40 x 100 matrix like what you see in dumDistance . 这将返回一个40 x 100矩阵,就像您在dumDistance看到的dumDistance However, pdist2 does not use for loops . 但是, pdist2 for loops


Now onto your question: 现在到您的问题:

D(i,j) represents the Euclidean squared distance between word i from your page and word j from your dictionary. D(i,j)表示欧氏字之间的平方距离i从你的页面和文字j从你的字典。 You have 40 words on your page and 100 words in your dictionary. 您的页面上有40个单词,而字典中有100个单词。 Each word is represented by a 140 dimensional feature vector, and so the rows of D index the words of page while the columns of D index the words of dictionary . 每个单词都由140维特征向量表示,因此D的行索引page的单词,而D的列索引dictionary

What I mean here in terms of "distance" is in terms of the feature space. 我这里所说的“距离”是指特征空间。 Each word from your page and dictionary are represented as a 140 length vector. 页面和词典中的每个单词都表示为140个长度的向量。 Each entry (i,j) of D takes the i th vector from page and the j th vector from dictionary , each of their corresponding components subtracted, squared, and then they are summed up. D每个条目(i,j)都从page i 向量中提取page ,第j 向量则从dictionary ,它们各自的相应分量相减,平方后求和。 This output is then stored into D(i,j) . 然后将此输出存储到D(i,j) This gives you the dissimilarity between word i from your page and word j from your dictionary at D(i,j) . 这给你的字间的差异性i从你的page和文字j从你dictionaryD(i,j) The higher the value, the more dissimilar the two words are. 值越高,两个单词越相似

Minor Note: pdist2 computes the Euclidean distance while dumDistance computes the Euclidean squared distance. 次要说明: pdist2计算欧几里得距离,而dumDistance计算欧几里得平方距离。 If you want to have the same thing as dumDistance , simply square every element in D from pdist2 . 如果您想拥有与dumDistance相同的dumDistance ,只需将dumDistanceD每个元素平方pdist2 In other words, simply compute D.^2 . 换句话说,只需计算D.^2

Hope this helps. 希望这可以帮助。 Good luck! 祝好运!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM