[英]Why does corr() give me results with only int, uint or float type and not with object type?
Just to clarify, I use Python in Jupyter Notebook 为了澄清,我在Jupyter Notebook中使用Python
I want to improve my skills in Data Science so I took over a project who was ended last week. 我想提高我在数据科学方面的技能,所以我接手了一个上周结束的项目。
In this project, My purpose was to built a logistic regression. 在这个项目中,我的目的是建立一个逻辑回归。 I made my data preparation and so on and I made a feature selection and after all that, to refine my model, I made a corr() and get out the last features who was correlated.
我做了数据准备等等,然后我做了一个功能选择,然后,为了优化我的模型,我做了一个corr()并找出了相关的最后一个功能。
But I think this is not the optimal way to do the work. 但我认为这不是完成工作的最佳方式。 I think the corr() need to be made before the feature selection.
我认为corr()需要在特征选择之前进行。 So I tried to make corr() before the feature selection to see but I encounter a problem.
所以我尝试在功能选择之前制作corr(),但我遇到了问题。
Let's see how I did that the first time (this was after all my data preparation and so on) : 让我们看看我是如何第一次这样做的(这是在我的所有数据准备之后等等):
df1=pd.get_dummies(df[[cat_cols]])
df2=df[[cols]]
df_c=pd.concat([df1,df2],axis=1)
I tried a logistic regression, auc and so on and I made a feature selection with low variance 我尝试了逻辑回归,auc等等,我做了一个低方差的特征选择
Features = np.array(T)
Labels = np.array(z)
#T and z are my X and y
sel = fs.VarianceThreshold(threshold=(.8 * (1 - .8)))
Features_reduced = sel.fit_transform(Features)
And with my remaining features, I watched the correlation to make a final selection 凭借我的剩余功能,我观察了相关性以进行最终选择
T.corr()
corr = T.corr()
fig = plt.figure(figsize=(12, 12))
ax = fig.add_subplot(111)
cax = ax.matshow(corr,cmap='coolwarm', vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,len(T.columns),1)
ax.set_xticks(ticks)
plt.xticks(rotation=90)
ax.set_yticks(ticks)
ax.set_xticklabels(T.columns)
ax.set_yticklabels(T.columns)
plt.show()
I obtained something like this : 我得到了这样的东西:
https://image.noelshack.com/fichiers/2019/14/5/1554459054-stack.png https://image.noelshack.com/fichiers/2019/14/5/1554459054-stack.png
So far so good, my variable was in the type "uint" or "int" or float" so everythong woerkd all fine. 到目前为止这么好,我的变量是“uint”或“int”或浮动类型“所以每个人都很好。
But I think it's better to see the correlation before my modelisation. 但我认为在我的模型化之前看到相关性会更好。 To reject the variable soon.
尽快拒绝变量。
So I tried to do this piece of code after my data preparation but before my feature selection : 所以我在数据准备之后但在我的功能选择之前尝试做这段代码:
T.corr()
corr = T.corr()
fig = plt.figure(figsize=(12, 12))
ax = fig.add_subplot(111)
cax = ax.matshow(corr,cmap='coolwarm', vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,len(T.columns),1)
ax.set_xticks(ticks)
plt.xticks(rotation=90)
ax.set_yticks(ticks)
ax.set_xticklabels(T.columns)
ax.set_yticklabels(T.columns)
plt.show()
But some of my variables (the categoricals) are in 'str', not int "uint" anymore because I didn't made dummy with them. 但是我的一些变量(分类)在'str'中,而不是int“uint”,因为我没有用它们做假。 So the corr() didn't work for them, It just work for the "int" and the "float" type.
因此corr()对它们不起作用,它只适用于“int”和“float”类型。
I tried to transform my categorical variable in "category" but corr() didn't work for them either. 我试图在“类别”中转换我的分类变量,但corr()也不适用于它们。
I tried to transform them in "int" or "float" but there was no way it will work because my categorical columns was made of string like "Front_Website" and so on. 我尝试在“int”或“float”中对它们进行转换,但是它无法工作,因为我的分类列是由“Front_Website”之类的字符串组成的,依此类推。
So I transform them in dummy but now I have so many feature in my corr() because it is before my feature selection. 所以我用虚拟变换它们但现在我在corr()中有很多功能,因为它在我的特征选择之前。
So my question is : How to see the correlation of my database without transform them in dummy before ? 所以我的问题是:如何看待我的数据库的相关性而不是之前在虚拟中转换它们?
I just want to see the correlation between my variable from the beginning. 我只想从头开始看到我的变量之间的相关性。 And not just the "int" or "float" type.
而不仅仅是“int”或“float”类型。
I hope my post is clear. 我希望我的帖子很清楚。
Thanks. 谢谢。
EDIT : 编辑:
I tried that 我试过了
table = pd.crosstab(df['Club Member'], df['Profil Price Club'])
from scipy.stats import chi2_contingency
chi2, p, dof, expected = chi2_contingency(table.values)
print(chi2, p)
But it's very tedious to obtain this for all my categorical columns. 但是对于我的所有分类专栏来说,这是非常繁琐的。
There is no way to obtain this for all my categorical columns at one time ? 我无法同时为所有分类列获取此内容吗?
Trying to convert categorical variables either into dummy variables or right away to int or float is futile and will throw error. 尝试将分类变量转换为虚拟变量或立即转换为int或float是徒劳的,并且会抛出错误。 Also it does not make any sense to find correlation between categorical variables.
找到分类变量之间的相关性也没有任何意义。
You can use chi-square analysis to find the association between categorical variables, using this module : 您可以使用卡方分析来查找分类变量之间的关联,使用此模块:
from scipy.stats import chisquare
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.