简体   繁体   English

查看 python 中分类变量和数值变量之间相关性的最佳方法,

[英]Best way to see correlation between a categorical variable and numerical variable in python,

I have a pandas dataframe which stores user id, their salary range(out of 3 possible ranges), and profit they generated as below:我有一个 pandas dataframe 存储用户 ID,他们的工资范围(在 3 个可能的范围之外),以及他们产生的利润如下:

  user_id     salary_range     profit_amount  
 --------- ------------------ --------------- 
      123   0 - 35,000                   324  
      654   50,000 - 100,000            2083  
      129   50,000 - 100,000           20023  
      654   0 - 35,000                   699  
      398   35,000 - 49,999              298  

I would like to see if there is any correlation between a users salary range, and the profit they generate.我想看看用户的工资范围和他们产生的利润之间是否存在任何相关性。

Typically I would use a seaborn.heatmap along with pd.corr but this only works for 2 numerical variables, and while salary is typically a numerical amount, here the range is a categorical.通常我会使用seaborn.heatmappd.corr但这仅适用于 2 个数字变量,虽然薪水通常是一个数字量,但这里的范围是一个分类。

Personlly, my method of solving this would be to rank the ranges from 1 to 3, and then generate a correlation from there.个人而言,我解决此问题的方法是将范围从 1 到 3 进行排名,然后从那里生成相关性。 However I believe that there are other possible ways to do this, and would like to see if anybody can suggest an alternative correlation method between the range and profit?但是我相信还有其他可能的方法可以做到这一点,并且想看看是否有人可以建议范围和利润之间的替代相关方法?

To calculate the link between a quantitative variable and a qualitative variable you need to calculate Eta要计算定量变量和定性变量之间的联系,您需要计算 Eta

If it can help you for, in R you can use this function: etaSquared() on an anova如果它可以帮助你,在 R 你可以使用这个 function: etaSquared()在方差分析

I believe correct way to get the association between salary_range and profit_amount would be one way ANOVA.我相信获得salary_rangeprofit_amount之间关联的正确方法是方差分析的一种方式。

import pandas as pd
import numpy as np

data = {"user_id":[123,654,129,654,398],
    "salary_range":["0 - 35,000","50,000 - 100,000","50,000 - 100,000","0 - 35,000","35,000 - 49,999"],
    "profit_amount":[324,2083,20023,699,298]}

df = pd.DataFrame(data)
df

from scipy import stats
F, p = stats.f_oneway(df[df.salary_range=="0 - 35,000"].profit_amount,
                  df[df.salary_range=="35,000 - 49,999"].profit_amount,
                  df[df.salary_range=="50,000 - 100,000"].profit_amount)
print("Statistics Values: ",np.round(F,2), "\n","P _Value        :",np.round(p,2))

Output: Output:

Statistics Values:  0.84                                    
P _Value        : 0.54

If F score is towards 0, then there is no correlation between categorical column and continuous column.如果 F 分数接近 0,则分类列和连续列之间没有相关性。 That concludes that there no correlation.得出的结论是没有相关性。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 数值变量和分类变量之间的百分比 - Percentage between a numerical and categorical variable 在Python中为分类变量绘制数字Y轴,X轴时间序列的最佳方法是什么? - What is the best way to plot numerical Y axis, X axis Time series for a categorical variable in Python? 如何在python中找到两个分类变量之间的相关性? - How do I find correlation between two categorical variable in python? 我想在 Python 中将分类变量转换为数值 - I want to convert the categorical variable to numerical in Python 如何找到分类列和数值列之间的相关性 - how to find the correlation between categorical and numerical columns 分类变量和数值变量之间的相关性:TypeError - Correlation between categorical and numerical variables: TypeError 将带有 % 符号的分类变量转换为数值变量 Python Pandas - Converting Categorical Variable with % Sign to Numerical Variable Python Pandas 有没有办法在 Python 中针对一个数值变量绘制多个分类变量? - Is there any way of plotting several categorical variables against one numerical variable in Python? Python - 加速将分类变量转换为数字索引 - Python - Speed up for converting a categorical variable to it's numerical index 如何基于数值变量创建分类变量 - How to create categorical variable based on a numerical variable
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM