简体繁体 English

如何在MATLAB中选择RELIEFF算法中的k值

[英]How to choose value of k in RELIEFF Algorithm in MATLAB

原文 2016-03-13 10:19:28 2 1 algorithm/ matlab/ machine-learning

I'm using relieff algorithm to investigate the ranking of various inputs for solving a classification problem. 我正在使用relieff算法来研究解决分类问题的各种输入的排名。 I have five inputs and about 500 observations. 我有五个输入和大约500个观测值。 I'm using MATLAB to solve this. 我正在使用MATLAB来解决这个问题。

I start off by setting the k nearest neighbors for the algorithm to 2 and vary it all the way till 450. The ranking computed for these vary wildly initially and then stabilizes as the size of k nearest neighbors approaches 150. I'm attaching a graph of weight (higher weight - higher ranking) of each of the five attributes vs the number of k nearest neighbors below. 我首先将算法的k个最近邻居设置为2并将其一直变化到450.计算出来的排名最初变化很大，然后随着k个最近邻居的大小接近150而稳定。我附加了一个图表五个属性中的每一个的权重（较高权重 - 较高等级）与下面的k个最近邻居的数量。

I'm wondering how to choose the value of k given for the ReliefF algorithm 我想知道如何选择ReliefF算法给出的k值 ? ？

1 个解决方案

With the K vs Weights plot you've just answered your own question. 通过K vs Weights图，您刚刚回答了自己的问题。 That is indeed very smart. 这确实非常聪明。

The optimal K value for your dataset is where the elbow is (circa 350). 数据集的最佳K值是肘部的位置（约350）。
What does it mean? 这是什么意思？ It basically means that taking into account another neighbour does not give a better modelling of the data. 它基本上意味着考虑到另一个邻居不能提供更好的数据建模。
You can object then that choosing 350 or 400 will lead to the same results since the weights are equal. 您可以反对选择350或400将导致相同的结果，因为权重相等。 Correct. 正确。 However it is always recommended to choose for the smallest value because the model you're training will have a minor complexity (fewer number of neighbours to take into account) with respect to the same results (ie weights). 但是，始终建议选择最小值，因为相对于相同的结果（即权重），您正在训练的模型将具有较小的复杂性（需要考虑的邻居数量较少）。

Such bruteforcing techniques are commonly used for many algorithms in machine learning: 这种强制技术通常用于机器学习中的许多算法：

in K-NN to find the optimal number of neighbours 在K-NN中找到最佳邻居数
in K-Means to find the optimal number of clusters 在K-Means中找到最佳簇数
in SVMs to find optimal tuning parameters 在SVM中找到最佳调整参数

and so on and so forth... 等等等等...

I've been doing the very same experiment as you did, but with another dataset and I obtained the following plot: 我一直在做与你一样的实验，但是使用另一个数据集，我获得了以下情节：