简体   繁体   English

数据预处理 Python

[英]Data Preprocessing Python

I have a DataFrame in Python and I need to preprocess my data.我在 Python 中有一个 DataFrame,我需要预处理我的数据。 Which is the best method to preprocess data?, knowing that some variables have huge scale and others doesn't.哪个是预处理数据的最佳方法?知道某些变量具有巨大的规模,而另一些则没有。 Data hasn't huge deviance either.数据也没有巨大的偏差。 I tried with preprocessing.Scale function and it works, but I'm not sure at all if is the best method to proceed to the machine learning algorithms.我尝试使用 preprocessing.Scale 函数并且它有效,但我完全不确定是否是进行机器学习算法的最佳方法。

There are various techniques for data preprocessing, you can refer to the ideas in sklearn.preprocessing as potential guidelines to follow.数据预处理有多种技术,您可以参考 sklearn.preprocessing 中的思想作为潜在的指导方针。

http://scikit-learn.org/stable/modules/preprocessing.html http://scikit-learn.org/stable/modules/preprocessing.html

Preprocessing is coupled to the data you are studying, but in general you could explore:预处理与您正在研究的数据耦合,但通常您可以探索:

  1. Assessing missing values, by computing their percentage per column通过计算每列的百分比来评估缺失值
  2. Compute the variance and remove variables with near zero variance计算方差并删除方差接近零的变量
  3. Assess the inter variable correlation to detect redundancy评估变量间相关性以检测冗余

You can compute these scores easily in pandas as follows:您可以在 Pandas 中轻松计算这些分数,如下所示:

data_file = "your_input_data_file.csv"
data = pd.read_csv(data_file, delimiter="|")
variance = data.var()
variance = variance.to_frame("variance")
variance["feature_names"] = variance.index 
variance.reset_index(inplace=True)
#reordering columns 
variance = variance[["feature_names","variance"]]
logging.debug("exporting variance to csv file")
variance.to_csv(data_file+"_variance.csv", sep="|", index=False)

missing_values_percentage = data.isnull().sum()/data.shape[0]
missing_values_percentage = missing_values_percentage.to_frame("missing_values_percentage")
missing_values_percentage["feature_names"] = missing_values_percentage.index 
missing_values_percentage.reset_index(inplace=True)
missing_values_percentage = missing_values_percentage[["feature_names","missing_values_percentage"]]
logging.debug("exporting missing values to csv file")
missing_values_percentage.to_csv(data_file+"_mssing_values.csv", sep="|", index=False) 
correlation = data.corr()
correlation.to_csv(data_file+"_correlation.csv", sep="|") 

The above would generate three files holding respectively, the variance, missing values percentage and correlation results.以上将生成三个文件,分别保存方差、缺失值百分比和相关结果。

Refer to this blog article for a hands on tutorial.有关动手教程,请参阅博客文章。

  1. always split your data to train and test split to prevent overfiting.始终拆分您的数据以训练和测试拆分以防止过度拟合。

  2. if some of your features has big scale and some doesnt you should standard the data.make sure to sandard the data only on the train set not to couse overfiting.如果您的某些特征规模较大而有些特征规模不大,则您应该对数据进行标准化。确保仅在训练集上对数据进行打磨,以免过度拟合。

  3. you also have to look for missing datas and replace or remove them.您还必须查找丢失的数据并替换或删除它们。 if less than 0.5% of the data in a column is missing you can use 'dropna' otherwise you have to replace it with something(you can replace ut with zero,mean,the previous data...)如果列中缺少少于 0.5% 的数据,您可以使用“dropna”,否则您必须用某些东西替换它(您可以用零、平均值、以前的数据替换 ut...)

  4. you also have to check outliers by using boxplot.您还必须使用箱线图检查异常值。 outliers are point that are significantly different from other data in the same group can also affects your prediction in machine learning.异常值是与同一组中的其他数据显着不同的点,也会影响您在机器学习中的预测。

  5. its the best if we check the multicollinearity.如果我们检查多重共线性,它是最好的。 if some features have correlation we have multicollinearity can couse wrong prediction for our model.如果某些特征具有相关性,我们的多重共线性可能会导致我们模型的错误预测。

  6. for using your data some of the columns might be categorical with sholud be converted to numerical.为了使用您的数据,某些列可能是分类的,应该将其转换为数字。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM