简体   繁体   English

实现用于聚类的邻近矩阵

[英]Implementing proximity matrix for clustering

Please I am a little new to this field so pardon me if the question sound trivial or basic. 如果这个问题听起来微不足道或基本的话,请原谅我这个领域的新手。

I have a group of dataset(Bag of words to be specific) and I need to generate a proximity matrix by using their edit distance from each other to find and generate the proximity matrix . 我有一组数据集(一堆字是特定的),我需要通过使用彼此的编辑距离来生成邻近矩阵,以找到并生成邻近矩阵。

I am however quite confused how I will keep track of my data/strings in the matrix. 然而,我很困惑如何跟踪矩阵中的数据/字符串。 I need the proximity matrix for the purpose of clustering. 我需要接近矩阵用于聚类。

Or How generally do you approach this kinds of problem in the field. 或者你在这个领域如何处理这类问题。 I am using perl and R to implement this. 我使用perl和R来实现这一点。

Here is a typical code in perl I have written that reads from a text file containing my bag of words 这是我编写的perl中的典型代码,它从包含我的文字包的文本文件中读取

use strict ;
   use warnings ; 
   use Text::Levenshtein qw(distance) ;
   main(@ARGV);
   sub main
   {    
    my @TokenDistances ;
    my $Tokenfile  = 'TokenDistinct.txt';
    my @Token ;
    my $AppendingCount  = 0 ; 
    my @Tokencompare ;  
    my %Levcount  = ();
    open (FH ,"< $Tokenfile" ) or die ("Error opening file . $!");
     while(<FH>)
     {
        chomp $_;
        $_ =~ s/^(\s+)$//g;
        push (@Token , $_ ); 
     }
    close(FH); 
     @Tokencompare = @Token ; 


     foreach my $tokenWord(@Tokencompare)
     { 
        my $lengthoffile =  scalar @Tokencompare;
        my $i = 0 ;
        chomp $tokenWord ;

        #@TokenDistances = levDistance($tokenWord , \@Tokencompare );
        for($i = 0 ; $i < $lengthoffile ;$i++)
        {
            if(scalar @TokenDistances ==  scalar @Tokencompare)
            {
                print "Yipeeeeeeeeeeeeeeeeeeeee\n";
            }
            chomp $tokenWord   ;
            chomp $Tokencompare[$i];
            #print   $tokenWord. "   {$Tokencompare[$i]}  " . "      $TokenDistances[$i] " . "\n";
            #$Levcount{$tokenWord}{$Tokencompare[$i]} = $TokenDistances[$i];
            $Levcount{$tokenWord}{$Tokencompare[$i]} = levDistance($tokenWord , $Tokencompare[$i] );

        }

        StoreSortedValues ( \%Levcount ,\$tokenWord , \$AppendingCount);
        $AppendingCount++;
        %Levcount = () ;

     } 
    # %Levcount  = (); 
}

sub levDistance
{
    my $string1 = shift ;
    #my @StringList = @{(shift)};
    my $string2 =  shift ;
    return distance($string1 , $string2);
}


sub StoreSortedValues {


    my $Levcount  = shift;
    my $tokenWordTopMost = ${(shift)} ; 
    my $j = ${(shift)};
    my @ListToken;
    my $Tokenfile = 'LevResult.txt';

    if($j == 0 )
    {
        open (FH ,"> $Tokenfile" ) or die ("Error opening file . $!");
    }
    else
    {
        open (FH ,">> $Tokenfile" ) or die ("Error opening file . $!");
    }

                print $tokenWordTopMost; 
                my %tokenWordMaster = %{$Levcount->{$tokenWordTopMost}};
                @ListToken = sort { $tokenWordMaster{$a} cmp $tokenWordMaster{$b} }   keys %tokenWordMaster;
            #@ListToken = keys %tokenWordMaster;

        print FH "-------------------------- " . $tokenWordTopMost . "-------------------------------------\n";
        #print FH  map {"$_  \t=>  $tokenWordMaster{$_} \n "}   @ListToken;
        foreach my $tokey (@ListToken)
        {
            print FH  "$tokey=>\t" . $tokenWordMaster{$tokey} . "\n" 

        }

        close(FH) or  die ("Error Closing File.  $!");

}

the problem is how can I represent the proximity matrix from this and still be able to keep track of which comparison represent which in my matrix. 问题是如何从中表示邻近矩阵,并且仍然能够跟踪哪个比较表示我的矩阵中的哪个。

In the RecordLinkage package there is the levenshteinDist function, which is one way of calculating an edit distance between strings. RecordLinkage包中有levenshteinDist函数,它是计算字符串之间编辑距离的一种方法。

install.packages("RecordLinkage")
library(RecordLinkage)

Set up some data: 设置一些数据:

fruit <- c("Apple", "Apricot", "Avocado", "Banana", "Bilberry", "Blackberry", 
    "Blackcurrant", "Blueberry", "Currant", "Cherry")

Now create a matrix consisting of zeros to reserve memory for the distance table. 现在创建一个由零组成的矩阵,为距离表保留内存。 Then use nested for loops to calculate the individual distances. 然后使用嵌套for循环来计算各个距离。 We end with a matrix with a row and a column for each fruit. 我们以每个水果的行和列的矩阵结束。 Thus we can rename the columns and rows to be identical to the original vector. 因此,我们可以将列和行重命名为与原始向量相同。

fdist <- matrix(rep(0, length(fruit)^2), ncol=length(fruit))
for(i in seq_along(fruit)){
  for(j in seq_along(fruit)){
    fdist[i, j] <- levenshteinDist(fruit[i], fruit[j])
  }
}
rownames(fdist) <- colnames(fdist) <- fruit

The results: 结果:

fdist

             Apple Apricot Avocado Banana Bilberry Blackberry Blackcurrant
Apple            0       5       6      6        7          9           12
Apricot          5       0       6      7        8         10           10
Avocado          6       6       0      6        8          9           10
Banana           6       7       6      0        7          8            8
Bilberry         7       8       8      7        0          4            9
Blackberry       9      10       9      8        4          0            5
Blackcurrant    12      10      10      8        9          5            0
Blueberry        8       9       9      8        3          3            8
Currant          7       5       6      5        8         10            6
Cherry           6       7       7      6        4          6           10

The proximity or similarity (or dissimilarity) matrix is just a table that stores the similarity score for pairs of objects. 接近度或相似度(或不相似度)矩阵只是存储对象对的相似度得分的表。 So, if you have N objects, then the R code can be simMat <- matrix(nrow = N, ncol = N) , and then each entry, (i,j), of simMat indicates the similarity between item i and item j. 所以,如果你有N个对象,那么R代码可以是simMat <- matrix(nrow = N, ncol = N) ,然后simMat每个条目(i,j)表示项目i和项目j之间的相似性。

In R, you can use several packages, including vwr , to calculate the Levenshtein edit distance. 在R中,您可以使用多个包(包括vwr )来计算Levenshtein编辑距离。

You may also find this Wikibook to be of interest: http://en.wikibooks.org/wiki/R_Programming/Text_Processing 您可能还会感兴趣的是这个Wikibook: http//en.wikibooks.org/wiki/R_Programming/Text_Processing

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM