简体   繁体   English

Weka总是为不同的数据生成相同的集群

[英]Weka always producing same clusters for different data

I'm trying to use Weka to do K-Means clustering on a set of data, examining how different weights affect different attributes. 我正在尝试使用Weka对一组数据进行K-Means聚类,检查不同权重如何影响不同的属性。

However, when I adjust the weights of each attribute, I'm not seeing any difference in the clustering. 但是,当我调整每个属性的权重时,我没有看到聚类的任何差异。

//Initialize file readers
...
Instances dataSet = readDataFile(dataReader);
double[][] modifiers = readNormalizationFile(normReader, dataSet.numAttributes());
normalize(dataSet, modifiers);
SimpleKMeans kMeans = new SimpleKMeans();
kMeans.setPreserveInstancesOrder(true);
int[] clusters = null;
try
{
    System.out.println(kMeans.getSeed());
    if(distMet != 0)
        kMeans.setDistanceFunction(new ManhattanDistance(dataSet));
    kMeans.setNumClusters(k);
    kMeans.buildClusterer(dataSet);

    clusters = kMeans.getAssignments();
}
//Print clusters

The first dimension of the "modifiers" array corresponds to each attribute, and within each there are two values. “修饰符”数组的第一个维度对应于每个属性,并且每个属性内有两个值。 The first is subtracted from the attribute value, and then the result is divided by the second value. 从属性值中减去第一个,然后将结果除以第二个值。

The normalization goes like this: 规范化如下:

public static void normalize(Instances dataSet, double[][] modifiers)
{
    for(int i = 0; i < dataSet.numInstances(); i++)
    {
        Instance currInst = dataSet.instance(i);
        double[] values = currInst.toDoubleArray();
        for(int j = 0; j < values.length; j++)
        {
            currInst.setValue(j, (values[j] - modifiers[j][0]) / modifiers[j][1]);
        }
    }
}

My expectation is that increasing the second normalization should reduce the importance of a particular attribute to the clustering and therefore change how clusters are assigned, but that isn't what I'm observing. 我的期望是增加第二次归一化应该降低特定属性对聚类的重要性,从而改变聚类的分配方式,但这不是我所观察到的。 My debugger is showing that the correctly normalized values are being sent into the clusterer, but I find it hard to believe that Weka is messing up instead of me. 我的调试器显示正确规范化的值被发送到群集器中,但我发现很难相信Weka正在搞乱而不是我。

Have I used Weka's K-Means correctly, or have I left out something important? 我是否正确使用过Weka的K-Means,还是我遗漏了一些重要的东西?

There is an option for NormalizableDistance Distance Measures (such as Euclidean and Manhattan) called dontNormalize , which may automatically be normalizing the values for you. NormalizableDistance Distance Measures(例如Euclidean和Manhattan)有一个名为dontNormalize的选项 ,它可以自动为您标准化值。 By default, this would be enabled, which could possibly undo all of the work that was done in your normalize function call. 默认情况下,这将启用,这可能会撤消在规范化函数调用中完成的所有工作。

I ran tests for a random dataset, then manipulated one of the attributes data for a second trial, and the two clusters ended up being identical. 我为随机数据集运行测试,然后操纵其中一个属性数据进行第二次试验,这两个集群最终完全相同。 Setting the value to true led to different clusters and therefore allocations of the instances in the dataset. 将值设置为true会导致不同的集群,从而导致数据集中实例的分配。

Hope this Helps! 希望这可以帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM