简体   繁体   English

scikit-learn使用什么距离函数来分类特征?

[英]What distance function is scikit-learn using for categorical features?

I'm learning a little ML. 我正在学习一点ML。 I know the basics of k-nearest neighbors (kNN), but I've always seen it used for continuous data in examples. 我知道k最近邻(kNN)的基础知识,但我一直在示例中始终看到它用于连续数据。

The tutorial I'm following now uses kNN to classify some data of a mixed type (continuous features and several categorical features). 我正在关注的教程现在使用kNN对一些混合类型的数据(连续特征和一些分类特征)进行分类。 I know for continuous ones the usually just use something like the Euclidean distance or other, but how do they deal with it when it's mixed? 我知道连续的通常只使用诸如欧几里得距离之类的东西,但是当它们混合时如何处理呢?

I see how a distance could be calculated for a binary variable easily, but what about a categorical one without an "order"? 我知道如何轻松地计算一个二进制变量的距离,但是没有“顺序”的绝对变量又如何呢?

edit: I'm following this tutorial of a Kaggle problem. 编辑:我正在关注Kaggle问题的教程。 After cleansing the data, he has it in the form: 清除数据后,他具有以下形式:

Survived    Pclass  Sex Age Fare    Embarked    Title   IsAlone Age*Class
0   0   3   0   1   0   0   1   0   3
1   1   1   1   2   3   1   3   0   2
2   1   3   1   1   1   0   2   1   3
3   1   1   1   2   3   0   3   0   2
4   0   3   0   2   1   0   1   1   6
5   0   3   0   1   1   2   1   1   3
6   0   1   0   3   3   0   1   1   3
7   0   3   0   0   2   0   4   0   0
8   1   3   1   1   1   0   3   0   3
9   1   2   1   0   2   1   3   0   0

(Where the first column is actually the ID) (第一列实际上是ID)

So it's a little strange because it's a mix of binary (eg, Sex), categorical and ordered (eg, Age is binned into 4 or 5 age brackets), and categorical but unordered (eg, Embarked is 0, 1, or 2 based on just which port they got on, so I don't think it has an order). 所以这有点奇怪,因为它是二进制(例如性别),分类和有序(例如,年龄分为4或5个年龄段),分类但无序(例如,Embarked基于0、1或2)的混合他们是在哪个港口上的,所以我认为没有订单)。

The data is split like so: 数据按如下方式拆分:

X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]
X_test  = test_df.drop("PassengerId", axis=1).copy()
X_train.shape, Y_train.shape, X_test.shape

And then it all just gets passed to kNN like this: 然后将它们全部传递给kNN,如下所示:

knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
acc_knn

So how is it doing the kNN stuff? 那么如何处理kNN呢? We haven't given any info or directions to it. 我们尚未提供任何信息或指示。

sklearn's kNN will use the same (chosen) metric for all features (which is indicated in the API ; no option to mix metrics!). sklearn的kNN将对所有功能使用相同的(选择的)度量(在API中指示;没有选择混合度量!)。

You are right, that this is problematic in the mixed case, but it's your job to prepare your data for this! 没错,在混合情况下这是有问题的,但是为此准备数据是您的工作! The standard approach is to use one-hot encoding as explained here : 标准的方法是使用一个热码为解释在这里

Often features are not given as continuous values but categorical. 通常,特征不是连续值,而是分类值。

... ...

Such integer representation can not be used directly with scikit-learn estimators, as these expect continuous input, and would interpret the categories as being ordered, which is often not desired (ie the set of browsers was ordered arbitrarily). 这样的整数表示不能直接与scikit-learn估计器一起使用,因为它们期望连续输入,并且会将类别解释为有序的,这通常是不希望的(即,任意设置浏览器的顺序)。

One possibility to convert categorical features to features that can be used with scikit-learn estimators is to use a one-of-K or one-hot encoding, which is implemented in OneHotEncoder. 将分类特征转换为可与scikit-learn估计器一起使用的特征的一种可能性是使用在OneHotEncoder中实现的K之一或一热编码。 This estimator transforms each categorical feature with m possible values into m binary features, with only one active. 该估计器将具有m个可能值的每个分类特征转换为m个二进制特征,并且只有一个处于活动状态。

Depending on your data, this might increase the number of features a lot! 根据您的数据,这可能会增加很多功能! In this case you need to make a decision: 在这种情况下,您需要做出决定:

  • use dense data-structures (and still be able to use kd-trees / ball-trees internally) 使用密集的数据结构(并且仍然能够在内部使用kd-trees / ball-trees)
  • use sparse data-structures (which will use brute-force lookups; Note: fitting on sparse input will override the setting of this parameter, using brute force. ) 使用稀疏数据结构(将使用蛮力查找; Note: fitting on sparse input will override the setting of this parameter, using brute force.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM