简体   繁体   中英

Why does corr() give me results with only int, uint or float type and not with object type?

Just to clarify, I use Python in Jupyter Notebook

I want to improve my skills in Data Science so I took over a project who was ended last week.

In this project, My purpose was to built a logistic regression. I made my data preparation and so on and I made a feature selection and after all that, to refine my model, I made a corr() and get out the last features who was correlated.

But I think this is not the optimal way to do the work. I think the corr() need to be made before the feature selection. So I tried to make corr() before the feature selection to see but I encounter a problem.

Let's see how I did that the first time (this was after all my data preparation and so on) :

  • I made dummy with my categorical columns
df1=pd.get_dummies(df[[cat_cols]])

  • I concatened with my quantitatives columns
df2=df[[cols]]

df_c=pd.concat([df1,df2],axis=1)

I tried a logistic regression, auc and so on and I made a feature selection with low variance


Features = np.array(T)
Labels = np.array(z)

#T and z are my X and y

sel = fs.VarianceThreshold(threshold=(.8 * (1 - .8)))
Features_reduced = sel.fit_transform(Features)

And with my remaining features, I watched the correlation to make a final selection

T.corr()


corr = T.corr()
fig = plt.figure(figsize=(12, 12))
ax = fig.add_subplot(111)
cax = ax.matshow(corr,cmap='coolwarm', vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,len(T.columns),1)
ax.set_xticks(ticks)
plt.xticks(rotation=90)
ax.set_yticks(ticks)
ax.set_xticklabels(T.columns)
ax.set_yticklabels(T.columns)
plt.show()

I obtained something like this :

https://image.noelshack.com/fichiers/2019/14/5/1554459054-stack.png

So far so good, my variable was in the type "uint" or "int" or float" so everythong woerkd all fine.

But I think it's better to see the correlation before my modelisation. To reject the variable soon.

So I tried to do this piece of code after my data preparation but before my feature selection :

T.corr()


corr = T.corr()
fig = plt.figure(figsize=(12, 12))
ax = fig.add_subplot(111)
cax = ax.matshow(corr,cmap='coolwarm', vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,len(T.columns),1)
ax.set_xticks(ticks)
plt.xticks(rotation=90)
ax.set_yticks(ticks)
ax.set_xticklabels(T.columns)
ax.set_yticklabels(T.columns)
plt.show()

But some of my variables (the categoricals) are in 'str', not int "uint" anymore because I didn't made dummy with them. So the corr() didn't work for them, It just work for the "int" and the "float" type.

I tried to transform my categorical variable in "category" but corr() didn't work for them either.

I tried to transform them in "int" or "float" but there was no way it will work because my categorical columns was made of string like "Front_Website" and so on.

So I transform them in dummy but now I have so many feature in my corr() because it is before my feature selection.

So my question is : How to see the correlation of my database without transform them in dummy before ?

I just want to see the correlation between my variable from the beginning. And not just the "int" or "float" type.

I hope my post is clear.

Thanks.

EDIT :

I tried that

table = pd.crosstab(df['Club Member'], df['Profil Price Club'])

from scipy.stats import chi2_contingency

chi2, p, dof, expected = chi2_contingency(table.values)

print(chi2, p)

But it's very tedious to obtain this for all my categorical columns.

There is no way to obtain this for all my categorical columns at one time ?

Trying to convert categorical variables either into dummy variables or right away to int or float is futile and will throw error. Also it does not make any sense to find correlation between categorical variables.

You can use chi-square analysis to find the association between categorical variables, using this module :

from scipy.stats import chisquare

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM