简体   繁体   中英

Machine learning with incomplete data

I have one millions sample and there are about 1000 features. However, only a subset of features are measured for each sample. I want to perform machine learning to predict the result based on the features, however, I do not know how to handle the missing data. Since data are missing in random order, I cannot classify data based on the missing feature because the number of classes would be huge and there would be only few samples in each class. What is the best solution for handling this kind of problem?

Methods to treat missing values

1. Deletion:

It is of two types: List Wise Deletion and Pair Wise Deletion. 在此处输入图片说明

  • In list wise deletion, we delete observations where any of the variable is missing. Simplicity is one of the major advantage of this method, but this method reduces the power of model because it reduces the sample size.

  • In pair wise deletion, we perform analysis with all cases in which the variables of interest are present. Advantage of this method is, it keeps as many cases available for analysis. One of the disadvantage of this method, it uses different sample size for different variables.

  • Deletion methods are used when the nature of missing data is “Missing completely at random” else non random missing values can bias the model output.

2. Mean/ Mode/ Median Imputation :

Imputation is a method to fill in the missing values with estimated ones. The objective is to employ known relationships that can be identified in the valid values of the data set to assist in estimating the missing values. Mean / Mode / Median imputation is one of the most frequently used methods. It consists of replacing the missing data for a given attribute by the mean or median (quantitative attribute) or mode (qualitative attribute) of all known values of that variable. It can be of two types:-

  • Generalized Imputation: In this case, we calculate the mean or median for all non missing values of that variable then replace missing value with mean or median. Like in above table, variable “Manpower” is missing so we take average of all non missing values of “Manpower” (28.33) and then replace missing value with it.

  • Similar case Imputation: In this case, we calculate average for gender “Male” (29.75) and “Female” (25) individually of non missing values then replace the missing value based on gender. For “Male“, we will replace missing values of manpower with 29.75 and for “Female” with 25.

3. Prediction Model:

Prediction model is one of the sophisticated method for handling missing data. Here, we create a predictive model to estimate values that will substitute the missing data. In this case, we divide our data set into two sets: One set with no missing values for the variable and another one with missing values. First data set become training data set of the model while second data set with missing values is test data set and variable with missing values is treated as target variable. Next, we create a model to predict target variable based on other attributes of the training data set and populate missing values of test data set.We can use regression, ANOVA, Logistic regression and various modeling technique to perform this. There are 2 drawbacks for this approach:

  • The model estimated values are usually more well-behaved than the true values

  • If there are no relationships with attributes in the data set and the attribute with missing values, then the model will not be precise for estimating missing values.

4. KNN Imputation:

In this method of imputation, the missing values of an attribute are imputed using the given number of attributes that are most similar to the attribute whose values are missing. The similarity of two attributes is determined using a distance function. It is also known to have certain advantage & disadvantages.

Advantages:

  • k-nearest neighbour can predict both qualitative & quantitative attributes

  • Creation of predictive model for each attribute with missing data is not required

  • Attributes with multiple missing values can be easily treated

  • Correlation structure of the data is taken into consideration

Disadvantage:

  • KNN algorithm is very time-consuming in analyzing large database. It searches through all the dataset looking for the most similar instances.

  • Choice of k-value is very critical. Higher value of k would include attributes which are significantly different from what we need whereas lower value of k implies missing out of significant attributes.

Source: https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/

Your problem is aa common case in data analysis in machine learning. While it is hard to tell how to resolve your problem exactly - without knowing the data, what you want to predictice, or the models you are thinking about (eg generative or dirscriminative) - I will try to give you some pointers.

References

First, some references: I found (Benjamin Marlin's PhD Thesis]( http://www.cs.ubc.ca/~bmarlin/research/phd_thesis/marlin-phd-thesis.pdf ) to be a good place to start. I haven't read the full PhD thesis but came across it a couple. It might be useful to give you a quick start on the matter. There is also a book on "Statistical Analysis with Missing Data" by Little and Rubin that might be useful for you. There is a vast body of ltierature on the topic, this review may help you to get an overview: A Review of Methods for Missing Data (the review exemplarily discusses a research study for regarding asthma symptoms, but the approaches may still be useful to you). Beneath the literature, there is also a Wikipedia page on Missing Data that might provide some basic insights.

Summary

Some simple approaches to get you started:

  • Determine the type of missing data (this may be crucial for selecting an approach as discussed in the references above):
    • Missing Completely at Random (MCAR): The probability of a missed feature completely independent of any observable or unobservable variable.
    • Missing at Random (MAR): The probability of a missed feature depends on observable or unobserable variables (ie other observable or unobservable variables "explain" the missed feature).
    • Missing not at Random (MNAR) - according to your description this may not paply yo you.
  • Determine the cause of missing data; this will also help you identify the type of missing data, eg the difference between MCAR and MAR, as well as appropriate approaches to missing data.
    • Is the data not available in the first place (assume a classification class with 2 classes, and some features do not make sense for one of the classes)
    • Is the data available but not recorded (eg faulty sensor, or participants in a study not filling in fields)
    • Is the data recorded but got missing during pre/processing (eg sensors recorded max/min values, NaN values or alike which were thrown away in pre-processing, or fields thrown away due to anonymization in studies)
    • ...
  • Deal with missing data (only some simple approaches here):
    • Ignore missing data (eg ignore features); this may, of course, be difficult for MCAR if there are no features that are present for all rows.
    • Fill in missing data:
      • Use default values (eg if fields in a stdy are not filled by all participants, fill it with the mean value or some default, or some value indicating that it is missing - the information that the field is missing may also be useful for machine learning, eg for the MAR case).
      • Guess values
      • Infer the value (eg by imputation techniques that may use simple, eg k-NN, or more complex approaches)
        • Interpolation may be a special case here ...
      • Transform the data (eg dimensionality reduction, random projects etc.; this is of course more difficult using categorical data)
    • ...

Overall, there are many valid approaches and it depends strongly on your task/application. Still, start by determining why the data is missing and what data is missing. Then, follow some of the references and start trying out simple approaches to see what works for you.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM