查看 python 中分类变量和数值变量之间相关性的最佳方法，

Question

I have a pandas dataframe which stores user id, their salary range(out of 3 possible ranges), and profit they generated as below:我有一个 pandas dataframe 存储用户 ID，他们的工资范围（在 3 个可能的范围之外），以及他们产生的利润如下：

  user_id     salary_range     profit_amount  
 --------- ------------------ --------------- 
      123   0 - 35,000                   324  
      654   50,000 - 100,000            2083  
      129   50,000 - 100,000           20023  
      654   0 - 35,000                   699  
      398   35,000 - 49,999              298

I would like to see if there is any correlation between a users salary range, and the profit they generate.我想看看用户的工资范围和他们产生的利润之间是否存在任何相关性。

Typically I would use a seaborn.heatmap along with pd.corr but this only works for 2 numerical variables, and while salary is typically a numerical amount, here the range is a categorical.通常我会使用seaborn.heatmap和pd.corr但这仅适用于 2 个数字变量，虽然薪水通常是一个数字量，但这里的范围是一个分类。

Personlly, my method of solving this would be to rank the ranges from 1 to 3, and then generate a correlation from there.个人而言，我解决此问题的方法是将范围从 1 到 3 进行排名，然后从那里生成相关性。 However I believe that there are other possible ways to do this, and would like to see if anybody can suggest an alternative correlation method between the range and profit?但是我相信还有其他可能的方法可以做到这一点，并且想看看是否有人可以建议范围和利润之间的替代相关方法？

Answer 1

To calculate the link between a quantitative variable and a qualitative variable you need to calculate Eta要计算定量变量和定性变量之间的联系，您需要计算 Eta

If it can help you for, in R you can use this function: etaSquared() on an anova如果它可以帮助你，在 R 你可以使用这个 function: etaSquared()在方差分析

Answer 2

I believe correct way to get the association between salary_range and profit_amount would be one way ANOVA.我相信获得salary_range和profit_amount之间关联的正确方法是方差分析的一种方式。

import pandas as pd
import numpy as np

data = {"user_id":[123,654,129,654,398],
    "salary_range":["0 - 35,000","50,000 - 100,000","50,000 - 100,000","0 - 35,000","35,000 - 49,999"],
    "profit_amount":[324,2083,20023,699,298]}

df = pd.DataFrame(data)
df

from scipy import stats
F, p = stats.f_oneway(df[df.salary_range=="0 - 35,000"].profit_amount,
                  df[df.salary_range=="35,000 - 49,999"].profit_amount,
                  df[df.salary_range=="50,000 - 100,000"].profit_amount)
print("Statistics Values: ",np.round(F,2), "\n","P _Value        :",np.round(p,2))

Output: Output：

Statistics Values:  0.84                                    
P _Value        : 0.54

If F score is towards 0, then there is no correlation between categorical column and continuous column.如果 F 分数接近 0，则分类列和连续列之间没有相关性。 That concludes that there no correlation.得出的结论是没有相关性。

查看 python 中分类变量和数值变量之间相关性的最佳方法，

问题描述

2 个解决方案

解决方案1
0 2020-07-15 12:55:27

解决方案2
0 2020-07-15 13:18:17

查看 python 中分类变量和数值变量之间相关性的最佳方法，

问题描述

2 个解决方案

解决方案1 0 2020-07-15 12:55:27

解决方案2 0 2020-07-15 13:18:17

解决方案1
0 2020-07-15 12:55:27

解决方案2
0 2020-07-15 13:18:17