如何使用Python使用最近邻算法对数据进行分类？

Question

I need to classify some data with (I hope) nearest-neighbour algorithm. 我需要用（希望）最近邻算法对一些数据进行分类。 I've googled this problem and found a lot of libraries (including PyML, mlPy and Orange), but I'm unsure of where to start here. 我已经搜索了这个问题并发现了很多库（包括PyML，mlPy和Orange），但我不确定从哪里开始。

How should I go about implementing k-NN using Python? 我该如何使用Python实现k-NN？

Answer 1

Particularly given the technique (k-Nearest Neighbors) that you mentioned in your Q, i would strongly recommend scikits.learn . 特别是考虑到你在Q中提到的技术（k-Nearest Neighbors），我强烈推荐scikits.learn 。 [ Note : after this Answer was posted, the lead developer of this Project informed me of a new homepage for this Project.] [ 注意：在本答案发布后，该项目的首席开发人员告知我该项目的新主页。

A few features that i believe distinguish this library from the others (at least the other Python ML libraries that i have used, which is most of them): 我相信一些功能可以将这个库与其他库区别开来（至少我使用过的其他Python ML库，其中大部分都是这样）：

an extensive diagnostics & testing library (including plotting modules, via Matplotlib)--includes feature-selection algorithms, confusion matrix , ROC, precision-recall, etc.; 广泛的诊断和测试库 （包括绘图模块，通过Matplotlib） - 包括特征选择算法，混淆矩阵，ROC，精确召回等;
a nice selection of 'batteries-included' data sets (including handwriting digits, facial images, etc.) particularly suited for ML techniques; 精选的“电池包含” 数据集 （包括手写数字，面部图像等），特别适用于ML技术;
extensive documentation (a nice surprise given that this Project is only about two years old) including tutorials and step-by-step example code (which use the supplied data sets); 广泛的文档（鉴于此项目只有两年左右，这是一个很好的惊喜），包括教程和逐步示例代码（使用提供的数据集）;

Without exception (at least that i can think of at this moment) the python ML libraries are superb. 毫无例外（至少我能想到这一点）python ML库非常棒。 (See the PyMVPA homepag e for a list of the dozen or so most popular python ML libraries.) （有关十几个最受欢迎的python ML库的列表，请参阅PyMVPA homepag e。）

In the past 12 months for instance, i have used ffnet (for MLP), neurolab (also for MLP), PyBrain (Q-Learning), neurolab (MLP), and PyMVPA (SVM) (all available from the Python Package Index )--these vary significantly from each other w/r/t maturity, scope, and supplied infrastructure, but i found them all to be of very high quality. 例如，在过去的12个月中，我使用过ffnet （用于MLP）， 神经元素 （也用于MLP）， PyBrain （Q-Learning）， 神经元素 （MLP）和PyMVPA （SVM）（所有这些都可以从Python包索引中获得） - 这些与成熟度，范围和供应基础设施相互显着不同，但我发现它们都具有非常高的质量。

Still, the best of these might be scikits.learn ; 尽管如此，其中最好的可能是scikits.learn ; for instance, i am not aware of any python ML library--other than scikits.learn--that includes any of the three features i mentioned above (though a few have solid example code and/or tutorials, none that i know of integrate these with a library of research-grade data sets and diagnostic algorithms). 例如，我不知道任何python ML库 - 除了scikits.learn - 包括我上面提到的三个功能中的任何一个（虽然有一些具有可靠的示例代码和/或教程，但我不知道集成这些包含研究级数据集和诊断算法库。

Second, given you the technique you intend to use ( k-nearest neighbor ) scikits.learn is a particularly good choice. 第二，给你你想要使用的技术（ k-最近邻居 ）scikits.learn是一个特别好的选择。 Scikits.learn includes kNN algorithms for both regression (returns a score) and classification (returns a class label), as well as detailed sample code for each. Scikits.learn包括用于回归（返回分数）和分类（返回类标签）的kNN算法，以及每个算法的详细示例代码。

Using the scikits.learn k-nearest neighbor module (literally) couldn't be any easier: 使用scikits.learn k-nearest neighbor模块（字面意思）可能不容易：

>>> # import NumPy and the relevant scikits.learn module
>>> import numpy as NP
>>> from sklearn import neighbors as kNN

>>> # load one of the sklearn-suppplied data sets
>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> # the call to load_iris() loaded both the data and the class labels, so
>>> # bind each to its own variable
>>> data = iris.data
>>> class_labels = iris.target

>>> # construct a classifier-builder by instantiating the kNN module's primary class
>>> kNN1 = kNN.NeighborsClassifier()

>>> # now construct ('train') the classifier by passing the data and class labels
>>> # to the classifier-builder
>>> kNN1.fit(data, class_labels)
      NeighborsClassifier(n_neighbors=5, leaf_size=20, algorithm='auto')

What's more, unlike nearly all other ML techniques, the crux of k-nearest neighbors is not coding a working classifier builder, rather the difficult step in building a production-grade k-nearest neighbor classifier/regressor is the persistence layer--ie, storage and fast retrieval of the data points from which the nearest neighbors are selected . 更重要的是，与几乎所有其他ML技术不同，k近邻的关键不是编码工作分类器构建器，而是构建生产级k最近邻分类器/回归器的困难步骤是持久层 - 即， 存储和快速检索从中选择最近邻居的数据点 。 For the kNN data storage layer, scikits.learn includes an algorithm for a ball tree (which i know almost nothing about other than is apparently superior to the kd-tree (the traditional data structure for k-NN) because its performance doesn't degrade in higher dimensional features space. 对于kNN数据存储层，scikits.learn包括一个球树的算法（除了显然优于kd树 （k-NN的传统数据结构）之外我几乎一无所知，因为它的性能不是降低高维特征空间。

Additionally, k-nearest neighbors requires an appropriate similarity metric (Euclidean distance is the usual choice, though not always the best one). 另外，k-最近邻居需要适当的相似性度量（欧几里德距离是通常的选择，但并不总是最好的选择）。 Scikits.learn includes a stand-along module comprised of various distance metrics as well as testing algorithms for selection of the appropriate one. Scikits.learn包括一个由各种距离度量组成的独立模块，以及用于选择合适距离度量的测试算法。

Finally, there are a few libraries that i have not mentioned either because they are out of scope (PyML, Bayesian); 最后，还有一些我没有提到的库，因为它们超出了范围（PyML，Bayesian）; they are not primarily 'libraries' for developers but rather applications for end users (eg, Orange), or they have unusual or difficult-to-install dependencies (eg, mlpy, which requires the gsl, which in turn must be built from source) at least for my OS, which is Mac OS X. 它们不是开发人员的主要“库”，而是最终用户的应用程序（例如Orange），或者它们具有不寻常或难以安装的依赖关系（例如，mlpy，这需要gsl，而gsl又必须从源代码构建）至少对于我的操作系统，即Mac OS X.

( Note : i am not a developer/committer for scikits.learn.) （注意：我不是scikits.learn的开发人员/提交者。）

如何使用Python使用最近邻算法对数据进行分类？

问题描述

1 个解决方案

解决方案1
61 已采纳 2011-09-07 07:23:52

如何使用Python使用最近邻算法对数据进行分类？

问题描述

1 个解决方案

解决方案1 61 已采纳 2011-09-07 07:23:52

解决方案1
61 已采纳 2011-09-07 07:23:52