Pandas 中多个特征的卡方检验

Question

我有一个这样的示例数据框

m_list = ['male','male','female','female']
whiskey_list = ['alcohol','no_alcohol','alcohol','no_alcohol']
f1 = [273,62,60,7]
f2 = [276,61,57,8]
l = [m_list,whiskey_list,f1,f2]
test_df = pd.DataFrame(l).T
test_df.columns = ['gender','drink_category','f1','f2']


    gender  drink_category  f1  f2
0   male    alcohol         273 276
1   male    no_alcohol      62  61
2   female  alcohol         60  57
3   female  no_alcohol      7   8

我想使用卡方检验查看 2 个类别 - gender和drink_category类别之间是否存在任何关系。 出于这些目的，我想为f1,f2....fn范围内的每个特征构建一个列联表，然后计算每个特征的p-values 。

这里的例子只有 2 个特征f1和f2但总的来说我有很多。

当我处理f1 ，我的列联表看起来像 -

gender   alcohol   no_alcohol
male      273        62
female    60         7

然后我会计算f1 p 值。

当我处理f2 ，我的列联表看起来像 -

gender   alcohol   no_alcohol
male      276        61
female    57         8

我如何使用pandas和scipy库来计算这个？

最后，我想要一个数据框，其中每个特征f1到fn都有 p 值。

Answer 1

我们可以使用 scipy.stat 的chi2_contingency来获取使用 pandas 的pivot函数构建的列联表的 p 值。

import pandas as pd
from scipy.stats import chi2_contingency

test_df = pd.DataFrame({'gender': ['male','male','female','female'],
                        'drink_category': ['alcohol','no_alcohol','alcohol','no_alcohol'],
                        'f1': [273,62,60,7],
                        'f2': [276,61,57,8]})

p = pd.Series()
for feature in [c for c in test_df.columns if c.startswith('f')]:
   _,p[feature],_,_ = chi2_contingency(test_df.pivot('gender','drink_category',feature))

print(p)

输出：

f1    0.155699
f2    0.339842
dtype: float64

Pandas 中多个特征的卡方检验

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-12-03 20:17:10

Pandas 中多个特征的卡方检验

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-12-03 20:17:10

解决方案1
1 已采纳 2019-12-03 20:17:10