简体   繁体   中英

How to get identical output from R apcluster and Sandia Cognitive Foundry AffinityPropagation

I am migrating an R script to Java. The R script uses the apcluster library. I am trying to recreate the same output using the Sandia Cognitive Foundry AffinityPropagation class. But I am finding it difficult to tune the selfDivergence value appropriately.

Here is my R and Java code.

library(apcluster)

NgramAdjMatrix <- matrix(
  c(0.0, 0.0, 0.0, 0.0, 1.0, 2.0, 0.0, 
    2.0, 4.0, 0.0, 3.0, 6.0, 0.0, 4.0, 8.0, 0.0, 5.0, 10.0, 0.0, 6.0, 12.0), 
nrow=7, 
ncol=3, 
byrow = T)

LatentClusters <- apcluster(negDistMat(r=2),NgramAdjMatrix,seed=1234)
representatives <- LatentClusters@exemplars
clustMembers <- LatentClusters@clusters
FinalNgramMatrix <- NgramAdjMatrix[representatives,]

Above R scripts gives this output,

    [,1] [,2] [,3]
[1,]   0    1    2
[2,]   0    4    8

Here is my Java code,

Vector[] data = new Vector[]{
        new Vector3(0.0, 0.0, 0.0),
        new Vector3(0.0, 1.0, 2.0),
        new Vector3(0.0, 2.0, 4.0),
        new Vector3(0.0, 3.0, 6.0),
        new Vector3(0.0, 4.0, 8.0),
        new Vector3(0.0, 5.0, 10.0),
        new Vector3(0.0, 6.0, 12.0)
    };

    System.out.println(Arrays.toString(data));

    AffinityPropagation<Vectorizable> instance
            = new AffinityPropagation<>(
                    EuclideanDistanceSquaredMetric.INSTANCE, 6);
    Collection<CentroidCluster<Vectorizable>> clusters = instance.learn(Arrays.asList(data));

    clusters.stream().forEach((cluster) -> {
        System.out.println(cluster.getCentroid() + "...");
    });

Above Java code gives this output,

<0.0, 1.0,  2.0>
<0.0, 2.0,  4.0>
<0.0, 5.0, 10.0>

The output is different and dependent to a very large extent on the selfDivergence parameter which is set to 6 in my code.

Is there some way to make the Java code behave same as the R code?

You are right that the results very much depend on how you set the selfDivergence parameter. After having looked at the Java code, it seems that the selfDivergence parameter of the Java implementation is the same as -p in the R implementation. So, at least theoretically,

apcluster(negDistMat(r=2),NgramAdjMatrix, p=-6)

should give you the same result. However, noise is added to the similarities which can result in varying results. As far as I can tell, the Java version does not add any random noise. I tried to add nonoise=TRUE for the R version, but did not obtain the same result as you have obtained for the Java version either. Further note that the default damping factor is 0.9 for the R implementation and 0.5 for the Java implementation. So, it seems the two implementations are really incomparable. Sorry that I cannot help better, but maybe I gave you a few hints about the differences.

Regards, UBod

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM