简体   繁体   中英

Nearest Neighbor for partially unknown vector

Let's say we have list of people and would like to find people like person X .

The feature vector has 3 items [weight, height, age] and there are 3 persons in our list. Note that we don't know height of person C .

A: [70kg, 170cm, 60y]
B: [60kg, 169cm, 50y]
C: [60kg, ?,     50y]

What would be the best way to find people closest to person A ?

My guess

Let's calculate the average value for height, and use it instead of unknown value.

So, let's say we calculated that 170cm is average value for height, and redefining person C as [60kg, ~170cm, 50y] .

Now we can find people closest to A, it will be A, C, B .

Problem

Now, the problem is that we put C with guessed ~170cm before than B with known 169cm .

It kinda feels wrong. We humans are smarter than machines, and know that there's little chance that C will be exactly 170cm . So, it would be better to put B with 169cm before than C .

But how can we calculate that penalty? (preferably in simple empiric algorithm) Should we somehow penalise vectors with unknown values? And by how much (maybe calculate average diff between every two person's height in the set)?

And how would that penalisation look like in a general case when dimension of feature vector is N and it has K known items and U unknown ( K + U = N )?

In this particular example, would it be better to use linear regression to fill the missing values instead of taking average? This way you may have more confidence in the guessed value and may not need penalty.

But if you want penalty, I have an idea of taking the ratio of non-missing features. In the example, there are 3 features in total. C has values in 2 of the features. So the ratio of non-missing features for C is 2/3. Adjust the similarity score by multiplying it with the ratio of non-missing features. For example, if the similarity between A and C is 0.9, the adjusted similarity is 0.9 * 2 / 3 = 0.6. Whereas the similarity between A and B will not be impacted since B has values for all the features and the ratio will be 1.

You can also weight the features when computing the ratio. For example, (weight, height, age) get the weights (0.3, 0.4, 0.3) respectively. Then missing the height feature will have the weighted ratio of (0.3 + 0.3) = 0.6. You can see C is penalized even more since we think height is more important than weight and age.

I would suggest , with data points for we have the the known attributes , use a learning model , linear regression or a multi layer perceptron to learn the unknown attribute and then with use of this model fill the unknown attributes. the average case is a special case of linear model

You are interested in the problem of Data Imputation .

There are several approaches to solving this problem, and I am just going to list some:

  • Mean/ Mode/ Median Imputation : Imputation is a method to fill in the missing values with estimated ones. The objective is to employ known relationships that can be identified in the valid values of the data set to assist in estimating the missing values. Mean / Mode / Median imputation is one of the most frequently used methods. It consists of replacing the missing data for a given attribute by the mean or median (quantitative attribute) or mode (qualitative attribute) of all known values of that variable. This can further be classified as generalized and similar case imputation.

  • Prediction Model : Prediction model is one of the sophisticated method for handling missing data. Here, we create a predictive model to estimate values that will substitute the missing data. In this case, we divide our data set into two sets: One set with no missing values for the variable and another one with missing values. First data set become training data set of the model while second data set with missing values is test data set and variable with missing values is treated as target variable. Next, we create a model to predict target variable based on other attributes of the training data set and populate missing values of test data set.

  • KNN(k-nearest neighbor) Imputation : In this method of imputation, the missing values of an attribute are imputed using the given number of attributes that are most similar to the attribute whose values are missing. The similarity of two attributes is determined using a distance function.

  • Linear Regression : A linear approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X. In prediction, linear regression can be used to fit a predictive model to an observed data set of y and X values. After developing such a model, if an additional value of X is then given without its accompanying value of y, the fitted model can be used to make a prediction of the value of y. Check this example if you want.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM