[英]What features to use for regression or classification?
Is there a way to determinate what features are the most relevant for my machine learning model. 有没有办法确定与我的机器学习模型最相关的功能。 If i have 20 features, is there a function that will decide what features should I use (or function that will automatically remove features that are not relevant)?
如果我有20个功能,是否有一个功能可以决定我应该使用哪些功能(或可以自动删除不相关的功能的功能)? I planned to do this for regression or classification model.
我计划对回归模型或分类模型进行此操作。
My desired output is list of values that are most relevant, and prediction 我想要的输出是最相关的值列表和预测
import pandas as pd
from sklearn.linear_model import LinearRegression
dic = {'par_1': [10, 30, 11, 19, 28, 33, 23],
'par_2': [1, 3, 1, 2, 3, 3, 2],
'par_3': [15, 3, 16, 65, 24, 56, 13],
'outcome': [101, 905, 182, 268, 646, 624, 465]}
df = pd.DataFrame(dic)
variables = df.iloc[:,:-1]
results = df.iloc[:,-1]
print(variables.shape)
print(results.shape)
reg = LinearRegression()
reg.fit(variables, results)
x = reg.predict([[18, 2, 21]])[0]
print(x)
The term you are looking for is feature selection : it consists in identifying which features are the most relevant ones for your analysis. 您要寻找的术语是特征选择 :它在于确定哪些特征与您的分析最相关。 The
scikit-learn
library has a whole section dedicated to it here . scikit-learn
库在此处专门介绍了整个章节。
Another possibility is to resort to dimensionality reduction techniques, like PCA (Principal Component Analysis) or Random Projections. 另一种可能性是诉诸降维技术,例如PCA (主成分分析)或随机投影。 Each technique has its pros and cons, so much depends on the data you have and the specific application.
每种技术都有其优缺点,因此很大程度上取决于您拥有的数据和特定的应用程序。
You can access the coef_ attribute of your reg
object: 您可以访问
reg
对象的coef_属性:
print(reg.coef_)
It's an oversimplification to call these weights, as they have a specific meaning in linear regression. 称这些权重为过于简单,因为它们在线性回归中具有特定含义。 But they're what you have.
但是他们就是你所拥有的。
When using linear model it is important to use linearly independent features. 使用线性模型时,重要的是使用线性独立的特征。 You can visualize correlation with
df.corr()
: 您可以使用
df.corr()
可视化相关性:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.decomposition import PCA
from sklearn.metrics import mean_squared_error
numpy.random.seed(2)
dic = {'par_1': [10, 30, 11, 19, 28, 33, 23],
'par_2': [1, 3, 1, 2, 3, 3, 2],
'par_3': [15, 3, 16, 65, 24, 56, 13],
'outcome': [101, 905, 182, 268, 646, 624, 465]}
df = pd.DataFrame(dic)
print(df.corr())
out:
par_1 par_2 par_3 outcome
par_1 1.000000 0.977935 0.191422 0.913878
par_2 0.977935 1.000000 0.193213 0.919307
par_3 0.191422 0.193213 1.000000 -0.158170
outcome 0.913878 0.919307 -0.158170 1.000000
You can see that par_1
and par_2
are strongly correlated. 您可以看到
par_1
和par_2
是高度相关的。 As @taga mentioned, you can use PCA
to map your features to a lower dimensional space where they are linearly independent: 如@taga所述,您可以使用
PCA
将要素映射到线性独立的较低维度空间:
variables = df.iloc[:,:-1]
results = df.iloc[:,-1]
pca = PCA(n_components=2)
pca_all = pca.fit_transform(variables)
print(np.corrcoef(pca_all[:, 0], pca_all[:, 1]))
out:
[[1.00000000e+00 1.87242048e-16]
[1.87242048e-16 1.00000000e+00]]
Remember to validate your model on out of sample data: 请记住要根据样本数据验证模型:
X_train = variables[:4]
y_train = results[:4]
X_valid = variables[4:]
y_valid = results[4:]
pca = PCA(n_components=2)
pca.fit(X_train)
pca_train = pca.transform(X_train)
pca_valid = pca.transform(X_valid)
print(pca_train)
reg = LinearRegression()
reg.fit(pca_train, y_train)
yhat_train = reg.predict(pca_train)
yhat_valid = reg.predict(pca_valid)
print(mean_squared_error(yhat_train, y_train))
print(mean_squared_error(yhat_valid, y_valid))
Feature selection is not trivial: there is a lot of sklearn modules that achieve it (See docs ) and you should always try at least a couple of them and see which on increase performance on out-of-sample data. 功能选择并非易事:有许多sklearn模块可以实现它(请参阅docs ),您应该始终尝试至少使用其中的几个,看看哪些可以提高样本外数据的性能。
Well, initially I faced the same problem.The two methods that I find useful for selecting relevant features are these. 好了,最初我遇到了同样的问题,我发现选择相关功能有用的两种方法是这些。
1.You can get the feature importance of each feature of your dataset by using the feature importance property of the model.Feature importance is an inbuilt class that comes with Tree Based Classifiers. 1,您可以通过使用模型的特征重要性属性来获取数据集中每个特征的特征重要性.Feature重要度是基于树的分类器随附的内置类。
import pandas as pd
import numpy as np
data = pd.read_csv("D://Blogs//train.csv")
X = data.iloc[:,0:20] #independent columns
y = data.iloc[:,-1] #target column i.e price range
from sklearn.ensemble import ExtraTreesClassifier
import matplotlib.pyplot as plt
model = ExtraTreesClassifier()
model.fit(X,y)
print(model.feature_importances_) #use inbuilt class feature_importances of tree based classifiers
#plot graph of feature importances for better visualization
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()
2.Correlation Matrix with Heatmap 2.带有热图的相关矩阵
Correlation states how the features are related to each other or the target variable. 关联说明要素如何相互关联或与目标变量关联。 It gives an intuition of how the features are correlated with the target variable.
它给出了特征如何与目标变量关联的直觉。
This is not my research but this blog feature selection which helped to clear my doubt and I'm sure will do yours too.:) 这不是我的研究,而是此博客功能的选择 ,这有助于消除我的疑问,我敢肯定,您也可以这样做。:)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.