简体   繁体   中英

Data Preprocessing Python

I have a DataFrame in Python and I need to preprocess my data. Which is the best method to preprocess data?, knowing that some variables have huge scale and others doesn't. Data hasn't huge deviance either. I tried with preprocessing.Scale function and it works, but I'm not sure at all if is the best method to proceed to the machine learning algorithms.

There are various techniques for data preprocessing, you can refer to the ideas in sklearn.preprocessing as potential guidelines to follow.

http://scikit-learn.org/stable/modules/preprocessing.html

Preprocessing is coupled to the data you are studying, but in general you could explore:

  1. Assessing missing values, by computing their percentage per column
  2. Compute the variance and remove variables with near zero variance
  3. Assess the inter variable correlation to detect redundancy

You can compute these scores easily in pandas as follows:

data_file = "your_input_data_file.csv"
data = pd.read_csv(data_file, delimiter="|")
variance = data.var()
variance = variance.to_frame("variance")
variance["feature_names"] = variance.index 
variance.reset_index(inplace=True)
#reordering columns 
variance = variance[["feature_names","variance"]]
logging.debug("exporting variance to csv file")
variance.to_csv(data_file+"_variance.csv", sep="|", index=False)

missing_values_percentage = data.isnull().sum()/data.shape[0]
missing_values_percentage = missing_values_percentage.to_frame("missing_values_percentage")
missing_values_percentage["feature_names"] = missing_values_percentage.index 
missing_values_percentage.reset_index(inplace=True)
missing_values_percentage = missing_values_percentage[["feature_names","missing_values_percentage"]]
logging.debug("exporting missing values to csv file")
missing_values_percentage.to_csv(data_file+"_mssing_values.csv", sep="|", index=False) 
correlation = data.corr()
correlation.to_csv(data_file+"_correlation.csv", sep="|") 

The above would generate three files holding respectively, the variance, missing values percentage and correlation results.

Refer to this blog article for a hands on tutorial.

  1. always split your data to train and test split to prevent overfiting.

  2. if some of your features has big scale and some doesnt you should standard the data.make sure to sandard the data only on the train set not to couse overfiting.

  3. you also have to look for missing datas and replace or remove them. if less than 0.5% of the data in a column is missing you can use 'dropna' otherwise you have to replace it with something(you can replace ut with zero,mean,the previous data...)

  4. you also have to check outliers by using boxplot. outliers are point that are significantly different from other data in the same group can also affects your prediction in machine learning.

  5. its the best if we check the multicollinearity. if some features have correlation we have multicollinearity can couse wrong prediction for our model.

  6. for using your data some of the columns might be categorical with sholud be converted to numerical.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM