What distance function is scikit-learn using for categorical features?

Question

I'm learning a little ML. I know the basics of k-nearest neighbors (kNN), but I've always seen it used for continuous data in examples.

The tutorial I'm following now uses kNN to classify some data of a mixed type (continuous features and several categorical features). I know for continuous ones the usually just use something like the Euclidean distance or other, but how do they deal with it when it's mixed?

I see how a distance could be calculated for a binary variable easily, but what about a categorical one without an "order"?

edit: I'm following this tutorial of a Kaggle problem. After cleansing the data, he has it in the form:

Survived    Pclass  Sex Age Fare    Embarked    Title   IsAlone Age*Class
0   0   3   0   1   0   0   1   0   3
1   1   1   1   2   3   1   3   0   2
2   1   3   1   1   1   0   2   1   3
3   1   1   1   2   3   0   3   0   2
4   0   3   0   2   1   0   1   1   6
5   0   3   0   1   1   2   1   1   3
6   0   1   0   3   3   0   1   1   3
7   0   3   0   0   2   0   4   0   0
8   1   3   1   1   1   0   3   0   3
9   1   2   1   0   2   1   3   0   0

(Where the first column is actually the ID)

So it's a little strange because it's a mix of binary (eg, Sex), categorical and ordered (eg, Age is binned into 4 or 5 age brackets), and categorical but unordered (eg, Embarked is 0, 1, or 2 based on just which port they got on, so I don't think it has an order).

The data is split like so:

X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]
X_test  = test_df.drop("PassengerId", axis=1).copy()
X_train.shape, Y_train.shape, X_test.shape

And then it all just gets passed to kNN like this:

knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
acc_knn

So how is it doing the kNN stuff? We haven't given any info or directions to it.

Answer 1

sklearn's kNN will use the same (chosen) metric for all features (which is indicated in the API ; no option to mix metrics!).

You are right, that this is problematic in the mixed case, but it's your job to prepare your data for this! The standard approach is to use one-hot encoding as explained here :

Often features are not given as continuous values but categorical.

...

Such integer representation can not be used directly with scikit-learn estimators, as these expect continuous input, and would interpret the categories as being ordered, which is often not desired (ie the set of browsers was ordered arbitrarily).

One possibility to convert categorical features to features that can be used with scikit-learn estimators is to use a one-of-K or one-hot encoding, which is implemented in OneHotEncoder. This estimator transforms each categorical feature with m possible values into m binary features, with only one active.

Depending on your data, this might increase the number of features a lot! In this case you need to make a decision:

use dense data-structures (and still be able to use kd-trees / ball-trees internally)
use sparse data-structures (which will use brute-force lookups; Note: fitting on sparse input will override the setting of this parameter, using brute force. )

What distance function is scikit-learn using for categorical features?

Question

1 answers

solution1
3 ACCPTED 2017-10-10 22:50:06

What distance function is scikit-learn using for categorical features?

Question

1 answers

solution1 3 ACCPTED 2017-10-10 22:50:06

solution1
3 ACCPTED 2017-10-10 22:50:06