简体   繁体   English

为什么corr()给我的结果只有int,uint或float类型而不是对象类型?

[英]Why does corr() give me results with only int, uint or float type and not with object type?

Just to clarify, I use Python in Jupyter Notebook 为了澄清,我在Jupyter Notebook中使用Python

I want to improve my skills in Data Science so I took over a project who was ended last week. 我想提高我在数据科学方面的技能,所以我接手了一个上周结束的项目。

In this project, My purpose was to built a logistic regression. 在这个项目中,我的目的是建立一个逻辑回归。 I made my data preparation and so on and I made a feature selection and after all that, to refine my model, I made a corr() and get out the last features who was correlated. 我做了数据准备等等,然后我做了一个功能选择,然后,为了优化我的模型,我做了一个corr()并找出了相关的最后一个功能。

But I think this is not the optimal way to do the work. 但我认为这不是完成工作的最佳方式。 I think the corr() need to be made before the feature selection. 我认为corr()需要在特征选择之前进行。 So I tried to make corr() before the feature selection to see but I encounter a problem. 所以我尝试在功能选择之前制作corr(),但我遇到了问题。

Let's see how I did that the first time (this was after all my data preparation and so on) : 让我们看看我是如何第一次这样做的(这是在我的所有数据准备之后等等):

  • I made dummy with my categorical columns 我用我的分类专栏制作了假人
df1=pd.get_dummies(df[[cat_cols]])

  • I concatened with my quantitatives columns 我与我的定量专栏相结合
df2=df[[cols]]

df_c=pd.concat([df1,df2],axis=1)

I tried a logistic regression, auc and so on and I made a feature selection with low variance 我尝试了逻辑回归,auc等等,我做了一个低方差的特征选择


Features = np.array(T)
Labels = np.array(z)

#T and z are my X and y

sel = fs.VarianceThreshold(threshold=(.8 * (1 - .8)))
Features_reduced = sel.fit_transform(Features)

And with my remaining features, I watched the correlation to make a final selection 凭借我的剩余功能,我观察了相关性以进行最终选择

T.corr()


corr = T.corr()
fig = plt.figure(figsize=(12, 12))
ax = fig.add_subplot(111)
cax = ax.matshow(corr,cmap='coolwarm', vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,len(T.columns),1)
ax.set_xticks(ticks)
plt.xticks(rotation=90)
ax.set_yticks(ticks)
ax.set_xticklabels(T.columns)
ax.set_yticklabels(T.columns)
plt.show()

I obtained something like this : 我得到了这样的东西:

https://image.noelshack.com/fichiers/2019/14/5/1554459054-stack.png https://image.noelshack.com/fichiers/2019/14/5/1554459054-stack.png

So far so good, my variable was in the type "uint" or "int" or float" so everythong woerkd all fine. 到目前为止这么好,我的变量是“uint”或“int”或浮动类型“所以每个人都很好。

But I think it's better to see the correlation before my modelisation. 但我认为在我的模型化之前看到相关性会更好。 To reject the variable soon. 尽快拒绝变量。

So I tried to do this piece of code after my data preparation but before my feature selection : 所以我在数据准备之后但在我的功能选择之前尝试做这段代码:

T.corr()


corr = T.corr()
fig = plt.figure(figsize=(12, 12))
ax = fig.add_subplot(111)
cax = ax.matshow(corr,cmap='coolwarm', vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,len(T.columns),1)
ax.set_xticks(ticks)
plt.xticks(rotation=90)
ax.set_yticks(ticks)
ax.set_xticklabels(T.columns)
ax.set_yticklabels(T.columns)
plt.show()

But some of my variables (the categoricals) are in 'str', not int "uint" anymore because I didn't made dummy with them. 但是我的一些变量(分类)在'str'中,而不是int“uint”,因为我没有用它们做假。 So the corr() didn't work for them, It just work for the "int" and the "float" type. 因此corr()对它们不起作用,它只适用于“int”和“float”类型。

I tried to transform my categorical variable in "category" but corr() didn't work for them either. 我试图在“类别”中转换我的分类变量,但corr()也不适用于它们。

I tried to transform them in "int" or "float" but there was no way it will work because my categorical columns was made of string like "Front_Website" and so on. 我尝试在“int”或“float”中对它们进行转换,但是它无法工作,因为我的分类列是由“Front_Website”之类的字符串组成的,依此类推。

So I transform them in dummy but now I have so many feature in my corr() because it is before my feature selection. 所以我用虚拟变换它们但现在我在corr()中有很多功能,因为它在我的特征选择之前。

So my question is : How to see the correlation of my database without transform them in dummy before ? 所以我的问题是:如何看待我的数据库的相关性而不是之前在虚拟中转换它们?

I just want to see the correlation between my variable from the beginning. 我只想从头开始看到我的变量之间的相关性。 And not just the "int" or "float" type. 而不仅仅是“int”或“float”类型。

I hope my post is clear. 我希望我的帖子很清楚。

Thanks. 谢谢。

EDIT : 编辑:

I tried that 我试过了

table = pd.crosstab(df['Club Member'], df['Profil Price Club'])

from scipy.stats import chi2_contingency

chi2, p, dof, expected = chi2_contingency(table.values)

print(chi2, p)

But it's very tedious to obtain this for all my categorical columns. 但是对于我的所有分类专栏来说,这是非常繁琐的。

There is no way to obtain this for all my categorical columns at one time ? 我无法同时为所有分类列获取此内容吗?

Trying to convert categorical variables either into dummy variables or right away to int or float is futile and will throw error. 尝试将分类变量转换为虚拟变量或立即转换为int或float是徒劳的,并且会抛出错误。 Also it does not make any sense to find correlation between categorical variables. 找到分类变量之间的相关性也没有任何意义。

You can use chi-square analysis to find the association between categorical variables, using this module : 您可以使用卡方分析来查找分类变量之间的关联,使用此模块:

from scipy.stats import chisquare

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 为什么类型提示`float`接受`int`,而它甚至不是子类? - Why does a type hint `float` accept `int` while it is not even a subclass? DataFrame 对象类型列到 int 或 float 错误 - DataFrame object type column to int or float error 为什么没有类型会给出 int 错误? 当删除时表示索引必须是 int 而不是 str - Why does none type give int error? When deleted says indicies must be int and not str Python Caesar Cipher 风格程序:为什么 python 给我不支持的操作数类型的 TypeError? - Python Caesar Cipher style program: Why does python give me TypeError of unsupported operand type(s)? 为什么TensorFlow会出现错误,提示我将不正确的形状和类型输入到占位符中? - Why does TensorFlow give me an error that I am feeding the incorrect shape and type into a placeholder? 为什么 nan 类型<class 'numpy.float64'>将 -9223372036854775808 作为 int64 返回?</class> - Why does a nan of type <class 'numpy.float64'> return -9223372036854775808 as an int64? 为什么myHDL手册中的示例为我提供了不同的结果? - Why does this example from the myHDL manual give me different results? 为什么'int'的类型是类型? - Why type of 'int' is type? 为什么 self 没有定义并给我不同的结果? - Why does self is not defined and give me different results? 为什么 Parse_Dates 在 Python 中给我错误的结果? - Why does Parse_Dates give me the wrong results in Python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM