I'm learning a little ML. I know the basics of k-nearest neighbors (kNN), but I've always seen it used for continuous data in examples.
The tutorial I'm following now uses kNN to classify some data of a mixed type (continuous features and several categorical features). I know for continuous ones the usually just use something like the Euclidean distance or other, but how do they deal with it when it's mixed?
I see how a distance could be calculated for a binary variable easily, but what about a categorical one without an "order"?
edit: I'm following this tutorial of a Kaggle problem. After cleansing the data, he has it in the form:
Survived Pclass Sex Age Fare Embarked Title IsAlone Age*Class
0 0 3 0 1 0 0 1 0 3
1 1 1 1 2 3 1 3 0 2
2 1 3 1 1 1 0 2 1 3
3 1 1 1 2 3 0 3 0 2
4 0 3 0 2 1 0 1 1 6
5 0 3 0 1 1 2 1 1 3
6 0 1 0 3 3 0 1 1 3
7 0 3 0 0 2 0 4 0 0
8 1 3 1 1 1 0 3 0 3
9 1 2 1 0 2 1 3 0 0
(Where the first column is actually the ID)
So it's a little strange because it's a mix of binary (eg, Sex), categorical and ordered (eg, Age is binned into 4 or 5 age brackets), and categorical but unordered (eg, Embarked is 0, 1, or 2 based on just which port they got on, so I don't think it has an order).
The data is split like so:
X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]
X_test = test_df.drop("PassengerId", axis=1).copy()
X_train.shape, Y_train.shape, X_test.shape
And then it all just gets passed to kNN like this:
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
acc_knn
So how is it doing the kNN stuff? We haven't given any info or directions to it.
sklearn's kNN will use the same (chosen) metric for all features (which is indicated in the API ; no option to mix metrics!).
You are right, that this is problematic in the mixed case, but it's your job to prepare your data for this! The standard approach is to use one-hot encoding as explained here :
Often features are not given as continuous values but categorical.
...
Such integer representation can not be used directly with scikit-learn estimators, as these expect continuous input, and would interpret the categories as being ordered, which is often not desired (ie the set of browsers was ordered arbitrarily).
One possibility to convert categorical features to features that can be used with scikit-learn estimators is to use a one-of-K or one-hot encoding, which is implemented in OneHotEncoder. This estimator transforms each categorical feature with m possible values into m binary features, with only one active.
Depending on your data, this might increase the number of features a lot! In this case you need to make a decision:
Note: fitting on sparse input will override the setting of this parameter, using brute force.
)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.