[英]Best way to see correlation between a categorical variable and numerical variable in python,
I have a pandas dataframe which stores user id, their salary range(out of 3 possible ranges), and profit they generated as below:我有一个 pandas dataframe 存储用户 ID,他们的工资范围(在 3 个可能的范围之外),以及他们产生的利润如下:
user_id salary_range profit_amount
--------- ------------------ ---------------
123 0 - 35,000 324
654 50,000 - 100,000 2083
129 50,000 - 100,000 20023
654 0 - 35,000 699
398 35,000 - 49,999 298
I would like to see if there is any correlation between a users salary range, and the profit they generate.我想看看用户的工资范围和他们产生的利润之间是否存在任何相关性。
Typically I would use a seaborn.heatmap
along with pd.corr
but this only works for 2 numerical variables, and while salary is typically a numerical amount, here the range is a categorical.通常我会使用
seaborn.heatmap
和pd.corr
但这仅适用于 2 个数字变量,虽然薪水通常是一个数字量,但这里的范围是一个分类。
Personlly, my method of solving this would be to rank the ranges from 1 to 3, and then generate a correlation from there.个人而言,我解决此问题的方法是将范围从 1 到 3 进行排名,然后从那里生成相关性。 However I believe that there are other possible ways to do this, and would like to see if anybody can suggest an alternative correlation method between the range and profit?
但是我相信还有其他可能的方法可以做到这一点,并且想看看是否有人可以建议范围和利润之间的替代相关方法?
To calculate the link between a quantitative variable and a qualitative variable you need to calculate Eta要计算定量变量和定性变量之间的联系,您需要计算 Eta
If it can help you for, in R you can use this function: etaSquared()
on an anova如果它可以帮助你,在 R 你可以使用这个 function:
etaSquared()
在方差分析
I believe correct way to get the association between salary_range
and profit_amount
would be one way ANOVA.我相信获得
salary_range
和profit_amount
之间关联的正确方法是方差分析的一种方式。
import pandas as pd
import numpy as np
data = {"user_id":[123,654,129,654,398],
"salary_range":["0 - 35,000","50,000 - 100,000","50,000 - 100,000","0 - 35,000","35,000 - 49,999"],
"profit_amount":[324,2083,20023,699,298]}
df = pd.DataFrame(data)
df
from scipy import stats
F, p = stats.f_oneway(df[df.salary_range=="0 - 35,000"].profit_amount,
df[df.salary_range=="35,000 - 49,999"].profit_amount,
df[df.salary_range=="50,000 - 100,000"].profit_amount)
print("Statistics Values: ",np.round(F,2), "\n","P _Value :",np.round(p,2))
Output: Output:
Statistics Values: 0.84
P _Value : 0.54
If F score is towards 0, then there is no correlation between categorical column and continuous column.如果 F 分数接近 0,则分类列和连续列之间没有相关性。 That concludes that there no correlation.
得出的结论是没有相关性。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.