[英]Find out which features are collinear in a dataset
I have constructed a model to predict the price of the house based on multiple features. 我已经建立了一个模型来预测基于多个特征的房屋价格。
import statsmodels.api as statsmdl
from sklearn import datasets
X = data[['NumberofRooms', 'YearBuilt','Type','NewConstruction']
y = data["Price"]
model = statsmdl.OLS(y, X).fit()
predictions = model.predict(X)
model.summary()
How can I figure out which of these features are collinear? 如何找出这些特征中的哪些是共线的?
You can use DataFrame.corr() method. 您可以使用DataFrame.corr()方法。
Demo: 演示:
In [27]: df = pd.DataFrame(np.random.randint(10, size=(5,3)), columns=list('abc'))
In [28]: df['d'] = df['a'] * 10 - df['b'] / np.pi
In [29]: df['e'] = np.log(df['c'] **2)
In [30]: c = df.corr()
In [31]: c
Out[31]:
a b c d e
a 1.000000 0.734858 0.113787 0.999837 0.067358
b 0.734858 1.000000 -0.523635 0.722485 -0.598739
c 0.113787 -0.523635 1.000000 0.129945 0.984257
d 0.999837 0.722485 0.129945 1.000000 0.084615
e 0.067358 -0.598739 0.984257 0.084615 1.000000
In [32]: c[c >= 0.7]
Out[32]:
a b c d e
a 1.000000 0.734858 NaN 0.999837 NaN
b 0.734858 1.000000 NaN 0.722485 NaN
c NaN NaN 1.000000 NaN 0.984257
d 0.999837 0.722485 NaN 1.000000 NaN
e NaN NaN 0.984257 NaN 1.000000
In [33]: c[c >= 0.7].stack().reset_index(name='cor').query("abs(cor) < 1.0")
Out[33]:
level_0 level_1 cor
1 a b 0.734858
2 a d 0.999837
3 b a 0.734858
5 b d 0.722485
7 c e 0.984257
8 d a 0.999837
9 d b 0.722485
11 e c 0.984257
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.