简体繁体 English

数据挖掘KNN分类器

[英]Data Mining KNN Classifier

原文 2014-01-14 18:47:53 3 1 database/ algorithm/ data-mining/ knn

Suppose a data analyst working for an insurance company was asked to build a predictive model for predicting whether a customer will buy a mobile home insurance policy. 假设要求为保险公司工作的数据分析师建立一个预测模型，以预测客户是否会购买移动房屋保险。 S/he tried kNN classifier with different number of neighbours (k=1,2,3,4,5). 他/她尝试了具有不同邻居数（k = 1,2,3,4,5）的kNN分类器。 S/he got the following F-scores measured on the training data: (1.0; 0.92; 0.90; 0.85; 0.82). 他/他在训练数据上获得了以下F分：（1.0; 0.92; 0.90; 0.85; 0.82）。 Based on that the analyst decided to deploy kNN with k=1. 基于此，分析人员决定部署k = 1的kNN。 Was it a good choice? 这是一个好选择吗？ How would you select an optimal number of neighbours in this case? 在这种情况下，您如何选择最佳邻居数？

1 个解决方案

It is not a good idea to select a parameter of a prediction algorithm using the whole training set as the result will be biased towards this particular training set and has no information about generalization performance (ie performance towards unseen cases). 使用整个训练集选择预测算法的参数不是一个好主意，因为结果将偏向于此特定训练集，并且没有有关泛化性能（即针对未见案例的性能）的信息。 You should apply a cross-validation technique eg 10-fold cross-validation to select the best K (ie K with largest F-value) within a range. 您应该应用交叉验证技术，例如10倍交叉验证，以选择一个范围内的最佳K（即，具有最大F值的K）。 This involves splitting your training data in 10 equal parts retain 9 parts for training and 1 for validation. 这涉及将您的训练数据分成10个相等的部分，保留9个部分用于训练，1个部分用于验证。 Iterate such that each part has been left out for validation. 进行迭代，以使每个部分都没有进行验证。 If you take enough folds this will allow you as well to obtain statistics of the F-value and then you can test whether these values for different K values are statistically significant. 如果折叠足够多，这也将使您获得F值的统计信息，然后可以测试不同K值的这些值是否在统计上有意义。

See eg also: http://pic.dhe.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2Falg_knn_training_crossvalidation.htm 另请参见例如： http : //pic.dhe.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2Falg_knn_training_crossvalidation.htm

The subtlety here however is that there is likely a dependency between the number of data points for prediction and the K-value. 但是，这里的微妙之处在于，预测数据点的数量和K值之间可能存在依赖关系。 So If you apply cross-validation you use 9/10 of the training set for training...Not sure whether any research has been performed on this and how to correct for that in the final training set. 因此，如果您使用交叉验证，则将训练集中的9/10用于训练...不确定是否已对此进行任何研究以及如何在最终训练集中进行校正。 Anyway most software packages just use the abovementioned techniques eg see SPSS in the link. 无论如何，大多数软件包只是使用上述技术，例如，请参阅链接中的SPSS。 A solution is to use leave-one-out cross-validation (each data samples is left out once for testing) in that case you have N-1 training samples(the original training set has N). 一种解决方案是使用留一法交叉验证（每个数据样本被遗漏一次进行测试），在这种情况下，您有N-1个训练样本（原始训练集为N）。