部分未知矢量的最近邻

Question

Let's say we have list of people and would like to find people like person X . 假设我们有人名单，并希望找到像X一样的人。

The feature vector has 3 items [weight, height, age] and there are 3 persons in our list. 特征向量有3个项目[weight, height, age] ，我们的列表中有3个人。 Note that we don't know height of person C . 请注意， 我们不知道人的身高C.

A: [70kg, 170cm, 60y]
B: [60kg, 169cm, 50y]
C: [60kg, ?,     50y]

What would be the best way to find people closest to person A ? 找到最接近A人的最佳方式是什么？

My guess 我猜

Let's calculate the average value for height, and use it instead of unknown value. 让我们计算高度的平均值，并使用它代替未知值。

So, let's say we calculated that 170cm is average value for height, and redefining person C as [60kg, ~170cm, 50y] . 因此，假设我们计算出170cm是身高的平均值，并将人C重新定义为[60kg, ~170cm, 50y] 。

Now we can find people closest to A, it will be A, C, B . 现在我们可以找到最接近A的人，它将是A, C, B 。

Problem 问题

Now, the problem is that we put C with guessed ~170cm before than B with known 169cm . 现在的问题是，我们把C与猜测~170cm前比B已知169cm 。

It kinda feels wrong. 有点不对劲。 We humans are smarter than machines, and know that there's little chance that C will be exactly 170cm . 我们人类比机器聪明，并且知道C几乎没有机会达到170cm 。 So, it would be better to put B with 169cm before than C . 所以，最好把B比C高出169cm 。

But how can we calculate that penalty? 但是我们如何计算这个惩罚呢？ (preferably in simple empiric algorithm) Should we somehow penalise vectors with unknown values? （最好是简单的经验算法）我们应该以某种方式惩罚具有未知值的向量？ And by how much (maybe calculate average diff between every two person's height in the set)? 多少（也许计算每组两个人身高之间的平均差异）？

And how would that penalisation look like in a general case when dimension of feature vector is N and it has K known items and U unknown ( K + U = N )? 当特征向量的维数为N并且它具有K已知项并且U未知（ K + U = N ）时，这种惩罚如何在一般情况下看起来如何？

Answer 1

In this particular example, would it be better to use linear regression to fill the missing values instead of taking average? 在这个特定的例子中，使用线性回归来填充缺失值而不是取平均值会更好吗？ This way you may have more confidence in the guessed value and may not need penalty. 通过这种方式，您可能对猜测值更有信心并且可能不需要惩罚。

But if you want penalty, I have an idea of taking the ratio of non-missing features. 但是，如果你想要惩罚，我有一个想法，即采取非缺失功能的比例。 In the example, there are 3 features in total. 在该示例中，总共有3个功能。 C has values in 2 of the features. C具有2个特征中的值。 So the ratio of non-missing features for C is 2/3. 因此C的非缺失特征的比率是2/3。 Adjust the similarity score by multiplying it with the ratio of non-missing features. 通过将相似性得分与非缺失要素的比率相乘来调整相似度得分。 For example, if the similarity between A and C is 0.9, the adjusted similarity is 0.9 * 2 / 3 = 0.6. 例如，如果A和C之间的相似度是0.9，则调整后的相似度是0.9 * 2/3 = 0.6。 Whereas the similarity between A and B will not be impacted since B has values for all the features and the ratio will be 1. 而A和B之间的相似性不会受到影响，因为B具有所有特征的值，并且比率将为1。

You can also weight the features when computing the ratio. 您还可以在计算比率时对功能进行加权。 For example, (weight, height, age) get the weights (0.3, 0.4, 0.3) respectively. 例如，（重量，高度，年龄）分别得到重量（0.3,0.4,0.3）。 Then missing the height feature will have the weighted ratio of (0.3 + 0.3) = 0.6. 然后缺少高度特征将具有（0.3 + 0.3）= 0.6的加权比率。 You can see C is penalized even more since we think height is more important than weight and age. 你可以看到C受到更多惩罚，因为我们认为身高比体重和年龄更重要。

Answer 2

I would suggest , with data points for we have the the known attributes , use a learning model , linear regression or a multi layer perceptron to learn the unknown attribute and then with use of this model fill the unknown attributes. 我建议，对于我们具有已知属性的数据点，使用学习模型，线性回归或多层感知器来学习未知属性，然后使用此模型填充未知属性。 the average case is a special case of linear model 平均情况是线性模型的一个特例

Answer 3

You are interested in the problem of Data Imputation . 您对Data Imputation的问题感兴趣。

There are several approaches to solving this problem, and I am just going to list some: 有几种方法可以解决这个问题，我只想列举一些：

Mean/ Mode/ Median Imputation : Imputation is a method to fill in the missing values with estimated ones. 均值/模式/中位数估算 ：估算是一种用估计值填充缺失值的方法。 The objective is to employ known relationships that can be identified in the valid values of the data set to assist in estimating the missing values. 目标是采用可以在数据集的有效值中识别的已知关系，以帮助估计缺失值。 Mean / Mode / Median imputation is one of the most frequently used methods. 均值/模式/中值插补是最常用的方法之一。 It consists of replacing the missing data for a given attribute by the mean or median (quantitative attribute) or mode (qualitative attribute) of all known values of that variable. 它包括通过该变量的所有已知值的均值或中值（定量属性）或模式（定性属性）替换给定属性的缺失数据。 This can further be classified as generalized and similar case imputation. 这可以进一步分类为广义和类似的案例插补。
Prediction Model : Prediction model is one of the sophisticated method for handling missing data. 预测模型 ：预测模型是处理缺失数据的复杂方法之一。 Here, we create a predictive model to estimate values that will substitute the missing data. 在这里，我们创建一个预测模型来估计将替代缺失数据的值。 In this case, we divide our data set into two sets: One set with no missing values for the variable and another one with missing values. 在这种情况下，我们将数据集分为两组：一组没有变量的缺失值，另一组缺少值。 First data set become training data set of the model while second data set with missing values is test data set and variable with missing values is treated as target variable. 第一个数据集成为模型的训练数据集，而第二个具有缺失值的数据集是测试数据集，具有缺失值的变量被视为目标变量。 Next, we create a model to predict target variable based on other attributes of the training data set and populate missing values of test data set. 接下来，我们创建一个模型，根据训练数据集的其他属性预测目标变量，并填充测试数据集的缺失值。
KNN(k-nearest neighbor) Imputation : In this method of imputation, the missing values of an attribute are imputed using the given number of attributes that are most similar to the attribute whose values are missing. KNN（k-最近邻居）插补 ：在这种插补方法中，使用与其值缺失的属性最相似的给定数量的属性来估算属性的缺失值。 The similarity of two attributes is determined using a distance function. 使用距离函数确定两个属性的相似性。
Linear Regression : A linear approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X. In prediction, linear regression can be used to fit a predictive model to an observed data set of y and X values. 线性回归 ：一种线性方法，用于建模标量因变量y与表示为X的一个或多个解释变量（或自变量）之间的关系。在预测中，线性回归可用于将预测模型拟合到y的观测数据集和X值。 After developing such a model, if an additional value of X is then given without its accompanying value of y, the fitted model can be used to make a prediction of the value of y. 在开发这样的模型之后，如果在没有其伴随的y值的情况下给出X的附加值，则可以使用拟合模型来预测y的值。 Check this example if you want. 如果需要，请检查此示例。

部分未知矢量的最近邻

问题描述

3 个解决方案

解决方案1
1 2017-09-23 07:49:32

解决方案2
1 2017-09-23 20:00:21

解决方案3
1 2017-09-24 07:18:14

部分未知矢量的最近邻

问题描述

3 个解决方案

解决方案1 1 2017-09-23 07:49:32

解决方案2 1 2017-09-23 20:00:21

解决方案3 1 2017-09-24 07:18:14

解决方案1
1 2017-09-23 07:49:32

解决方案2
1 2017-09-23 20:00:21

解决方案3
1 2017-09-24 07:18:14