简体   繁体   English

回归或分类要使用哪些功能?

[英]What features to use for regression or classification?

Is there a way to determinate what features are the most relevant for my machine learning model. 有没有办法确定与我的机器学习模型最相关的功能。 If i have 20 features, is there a function that will decide what features should I use (or function that will automatically remove features that are not relevant)? 如果我有20个功能,是否有一个功能可以决定我应该使用哪些功能(或可以自动删除不相关的功能的功能)? I planned to do this for regression or classification model. 我计划对回归模型或分类模型进行此操作。

My desired output is list of values that are most relevant, and prediction 我想要的输出是最相关的值列表和预测

import pandas as pd
from sklearn.linear_model import LinearRegression

dic = {'par_1': [10, 30, 11, 19, 28, 33, 23],
       'par_2': [1, 3, 1, 2, 3, 3, 2],
       'par_3': [15, 3, 16, 65, 24, 56, 13],
       'outcome': [101, 905, 182, 268, 646, 624, 465]}

df = pd.DataFrame(dic)

variables = df.iloc[:,:-1]
results = df.iloc[:,-1]

print(variables.shape)
print(results.shape)


reg = LinearRegression()
reg.fit(variables, results)

x = reg.predict([[18, 2, 21]])[0]
print(x)

The term you are looking for is feature selection : it consists in identifying which features are the most relevant ones for your analysis. 您要寻找的术语是特征选择 :它在于确定哪些特征与您的分析最相关。 The scikit-learn library has a whole section dedicated to it here . scikit-learn在此处专门介绍了整个章节。

Another possibility is to resort to dimensionality reduction techniques, like PCA (Principal Component Analysis) or Random Projections. 另一种可能性是诉诸降维技术,例如PCA (主成分分析)或随机投影。 Each technique has its pros and cons, so much depends on the data you have and the specific application. 每种技术都有其优缺点,因此很大程度上取决于您拥有的数据和特定的应用程序。

You can access the coef_ attribute of your reg object: 您可以访问reg对象的coef_属性:

print(reg.coef_)

It's an oversimplification to call these weights, as they have a specific meaning in linear regression. 称这些权重为过于简单,因为它们在线性回归中具有特定含义。 But they're what you have. 但是他们就是你所拥有的。

When using linear model it is important to use linearly independent features. 使用线性模型时,重要的是使用线性独立的特征。 You can visualize correlation with df.corr() : 您可以使用df.corr()可视化相关性:

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.decomposition import PCA
from sklearn.metrics import mean_squared_error

numpy.random.seed(2)

dic = {'par_1': [10, 30, 11, 19, 28, 33, 23],
       'par_2': [1, 3, 1, 2, 3, 3, 2],
       'par_3': [15, 3, 16, 65, 24, 56, 13],
       'outcome': [101, 905, 182, 268, 646, 624, 465]}

df = pd.DataFrame(dic)

print(df.corr())
out:
            par_1     par_2     par_3   outcome
par_1    1.000000  0.977935  0.191422  0.913878
par_2    0.977935  1.000000  0.193213  0.919307
par_3    0.191422  0.193213  1.000000 -0.158170
outcome  0.913878  0.919307 -0.158170  1.000000

You can see that par_1 and par_2 are strongly correlated. 您可以看到par_1par_2是高度相关的。 As @taga mentioned, you can use PCA to map your features to a lower dimensional space where they are linearly independent: 如@taga所述,您可以使用PCA将要素映射到线性独立的较低维度空间:

variables = df.iloc[:,:-1]
results = df.iloc[:,-1]

pca = PCA(n_components=2)
pca_all = pca.fit_transform(variables)

print(np.corrcoef(pca_all[:, 0], pca_all[:, 1]))
out:
[[1.00000000e+00 1.87242048e-16]
 [1.87242048e-16 1.00000000e+00]]

Remember to validate your model on out of sample data: 请记住要根据样本数据验证模型:

X_train = variables[:4]
y_train = results[:4]
X_valid = variables[4:]
y_valid = results[4:]

pca = PCA(n_components=2)
pca.fit(X_train)

pca_train = pca.transform(X_train)
pca_valid = pca.transform(X_valid)
print(pca_train)

reg = LinearRegression()
reg.fit(pca_train, y_train)

yhat_train = reg.predict(pca_train)
yhat_valid = reg.predict(pca_valid)

print(mean_squared_error(yhat_train, y_train))
print(mean_squared_error(yhat_valid, y_valid))

Feature selection is not trivial: there is a lot of sklearn modules that achieve it (See docs ) and you should always try at least a couple of them and see which on increase performance on out-of-sample data. 功能选择并非易事:有许多sklearn模块可以实现它(请参阅docs ),您应该始终尝试至少使用其中的几个,看看哪些可以提高样本外数据的性能。

Well, initially I faced the same problem.The two methods that I find useful for selecting relevant features are these. 好了,最初我遇到了同样的问题,我发现选择相关功能有用的两种方法是这些。

1.You can get the feature importance of each feature of your dataset by using the feature importance property of the model.Feature importance is an inbuilt class that comes with Tree Based Classifiers. 1,您可以通过使用模型的特征重要性属性来获取数据集中每个特征的特征重要性.Feature重要度是基于树的分类器随附的内置类。

import pandas as pd
import numpy as np
data = pd.read_csv("D://Blogs//train.csv")
X = data.iloc[:,0:20]  #independent columns
y = data.iloc[:,-1]    #target column i.e price range
from sklearn.ensemble import ExtraTreesClassifier
import matplotlib.pyplot as plt
model = ExtraTreesClassifier()
model.fit(X,y)
print(model.feature_importances_) #use inbuilt class feature_importances of tree based classifiers
#plot graph of feature importances for better visualization
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()

click to see image 点击查看图片

2.Correlation Matrix with Heatmap 2.带有热图的相关矩阵

Correlation states how the features are related to each other or the target variable. 关联说明要素如何相互关联或与目标变量关联。 It gives an intuition of how the features are correlated with the target variable. 它给出了特征如何与目标变量关联的直觉。

click to see image 点击查看图片

This is not my research but this blog feature selection which helped to clear my doubt and I'm sure will do yours too.:) 这不是我的研究,而是此博客功能的选择 ,这有助于消除我的疑问,我敢肯定,您也可以这样做。:)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在Keras中使用自定义功能进行文本分类 - How use custom features in Keras for text classification SVM分类任务中word2vec功能的输入格式是什么? - What is the input format for word2vec features in SVM classification task? 用> 700多个分类特征来表示或成形数据的最佳方法是什么? - What is the best way to represent or shape data with >700 features for classification? 如何使用StackingClassifier + Logistic回归(二进制分类)查找系数的特征名称 - How to find the features names of the coefficients using StackingClassifier + Logistic Regression (binary classification) 哪些特征可以帮助对句尾进行分类? 序列分类 - What features could help to classify the end of sentence? Sequence classification Python,OpenCV:如何将通过ORB提取的功能用于分类模型? - Python, OpenCV :How to use features extracted via ORB for a classification model? 使用 PyTorch 计算 95% 置信区间以进行分类和回归的正确方法是什么? - What is the proper way to compute 95% confidence intervals with PyTorch for classification and regression? 我可以将 Tensorboard 用于我的线性回归或线性分类问题吗? - Can i use Tensorboard for my linear regression or linear classification problem? 如何在回归预测任务中使用经/纬度作为特征 - How to use lat/long as features in regression prediction task 回归中,DV和IV中用于百分比特征的算法是什么? - Which algorithm to use for percentage features in my DV and IV, in regression?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM