简体   繁体   English

找出数据集中哪些要素共线

[英]Find out which features are collinear in a dataset

I have constructed a model to predict the price of the house based on multiple features. 我已经建立了一个模型来预测基于多个特征的房屋价格。

import statsmodels.api as statsmdl
from sklearn import datasets

X = data[['NumberofRooms', 'YearBuilt','Type','NewConstruction']
y = data["Price"]

model = statsmdl.OLS(y, X).fit()
predictions = model.predict(X)
model.summary()

How can I figure out which of these features are collinear? 如何找出这些特征中的哪些是共线的?

You can use DataFrame.corr() method. 您可以使用DataFrame.corr()方法。

Demo: 演示:

In [27]: df = pd.DataFrame(np.random.randint(10, size=(5,3)), columns=list('abc'))

In [28]: df['d'] = df['a'] * 10 - df['b'] / np.pi

In [29]: df['e'] = np.log(df['c'] **2)

In [30]: c = df.corr()

In [31]: c
Out[31]:
          a         b         c         d         e
a  1.000000  0.734858  0.113787  0.999837  0.067358
b  0.734858  1.000000 -0.523635  0.722485 -0.598739
c  0.113787 -0.523635  1.000000  0.129945  0.984257
d  0.999837  0.722485  0.129945  1.000000  0.084615
e  0.067358 -0.598739  0.984257  0.084615  1.000000

In [32]: c[c >= 0.7]
Out[32]:
          a         b         c         d         e
a  1.000000  0.734858       NaN  0.999837       NaN
b  0.734858  1.000000       NaN  0.722485       NaN
c       NaN       NaN  1.000000       NaN  0.984257
d  0.999837  0.722485       NaN  1.000000       NaN
e       NaN       NaN  0.984257       NaN  1.000000

In [33]: c[c >= 0.7].stack().reset_index(name='cor').query("abs(cor) < 1.0")
Out[33]:
   level_0 level_1       cor
1        a       b  0.734858
2        a       d  0.999837
3        b       a  0.734858
5        b       d  0.722485
7        c       e  0.984257
8        d       a  0.999837
9        d       b  0.722485
11       e       c  0.984257

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 sklearn:如何找出哪些特征负责预测标签? - sklearn: How to find out which features are responsible for predicted label? 在 PCA 之后找出我的组件中有哪些功能 - Find out which features are in my components after PCA sklearn.linear_model.LinearRegression 怎么还能在多共线数据集中找到解? - How can sklearn.linear_model.LinearRegression still find a solution in the multi-collinear dataset? 共线特征及其对线性模型的影响,任务:1 Logistic Regression - Collinear features and their effect on linear models,Task: 1 Logistic Regression 如何找出熊猫数据框的功能? - How to find out features of a pandas Data Frame? 分析给我的数据集,只有特征,没有问题陈述 - Analysing the dataset which was given to me with just features and no problem statement 如何在特征少于最初训练的原始数据集的数据集上使用标准缩放器 model - How to use standard scaler model on dataset having less features than original dataset in which it was initially trained 数据集特征编码和缩放 - Dataset features Encoding and Scaling 在“特征选择”中查找变换后的输出中的选定特征 - Find out selected features in transformed output in Feature Selection 无法在具有其他数值和分类变量的数据集中创建基于时间的要素 - Unable to create time-based features in a dataset which has other numerical and categorical variable
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM