Python、Pandas 和卡方独立性检验

Question

I am quite new to Python as well as Statistics.我对 Python 和统计都很陌生。 I'm trying to apply the Chi Squared Test to determine whether previous success affects the level of change of a person (percentage wise, this does seem to be the case, but I wanted to see whether my results were statistically significant).我正在尝试应用卡方检验来确定以前的成功是否会影响一个人的变化水平（百分比方面，情况似乎确实如此，但我想看看我的结果是否具有统计意义）。

My question is: Did I do this correctly?我的问题是：我这样做是否正确？ My results say the p-value is 0.0, which means that there is a significant relationship between my variables (which is what I want of course...but 0 seems a little bit too perfect for a p-value, so I'm wondering whether I did it incorrectly coding wise).我的结果说 p 值为 0.0，这意味着我的变量之间存在显着关系（这当然是我想要的......但 0 对于 p 值来说似乎有点太完美了，所以我想知道我是否在编码方面做得不正确）。

Here's what I did:这是我所做的：

import numpy as np
import pandas as pd
import scipy.stats as stats

d = {'Previously Successful' : pd.Series([129.3, 182.7, 312], index=['Yes - changed strategy', 'No', 'col_totals']),
 'Previously Unsuccessful' : pd.Series([260.17, 711.83, 972], index=['Yes - changed strategy', 'No', 'col_totals']),
 'row_totals' : pd.Series([(129.3+260.17), (182.7+711.83), (312+972)], index=['Yes - changed strategy', 'No', 'col_totals'])}

total_summarized = pd.DataFrame(d)

observed = total_summarized.ix[0:2,0:2]

Output: Observed输出：观察到

expected =  np.outer(total_summarized["row_totals"][0:2],
                 total_summarized.ix["col_totals"][0:2])/1000

expected = pd.DataFrame(expected)

expected.columns = ["Previously Successful","Previously Unsuccessful"]
expected.index = ["Yes - changed strategy","No"]

chi_squared_stat = (((observed-expected)**2)/expected).sum().sum()

print(chi_squared_stat)

crit = stats.chi2.ppf(q = 0.95, # Find the critical value for 95% confidence*
                  df = 8)   # *

print("Critical value")
print(crit)

p_value = 1 - stats.chi2.cdf(x=chi_squared_stat,  # Find the p-value
                         df=8)
print("P value")
print(p_value)

stats.chi2_contingency(observed= observed)

Output Statistics输出统计

Answer 1

A few corrections:一些更正：

Your expected array is not correct.您expected数组不正确。 You must divide by observed.sum().sum() , which is 1284, not 1000.您必须除以observed.sum().sum() ，即 1284，而不是 1000。
For a 2x2 contingency table such as this, the degrees of freedom is 1, not 8.对于像这样的 2x2 列联表，自由度是 1，而不是 8。
Your calculation of chi_squared_stat does not include a continuity correction .您对chi_squared_stat计算不包括连续性校正。 (But it isn't necessarily wrong to not use it--that's a judgment call for the statistician.) （但不使用它并不一定是错误的——这是统计学家的判断力。）

All the calculations that you perform (expected matrix, statistics, degrees of freedom, p-value) are computed by chi2_contingency :您执行的所有计算（预期矩阵、统计数据、自由度、p 值）均由chi2_contingency计算：

In [65]: observed
Out[65]: 
                        Previously Successful  Previously Unsuccessful
Yes - changed strategy                  129.3                   260.17
No                                      182.7                   711.83

In [66]: from scipy.stats import chi2_contingency

In [67]: chi2, p, dof, expected = chi2_contingency(observed)

In [68]: chi2
Out[68]: 23.383138325890453

In [69]: p
Out[69]: 1.3273696199438626e-06

In [70]: dof
Out[70]: 1

In [71]: expected
Out[71]: 
array([[  94.63757009,  294.83242991],
       [ 217.36242991,  677.16757009]])

By default, chi2_contingency uses a continuity correction when the contingency table is 2x2.默认情况下，当列联表为 2x2 时， chi2_contingency使用连续性校正。 If you prefer to not use the correction, you can disable it with the argument correction=False :如果您不想使用更正，可以使用参数correction=False禁用它：

In [73]: chi2, p, dof, expected = chi2_contingency(observed, correction=False)

In [74]: chi2
Out[74]: 24.072616672232893

In [75]: p
Out[75]: 9.2770200776879643e-07

Answer 2

degrees of freedom = (row-1)x(column-1).自由度 = (row-1)x(column-1)。 For a 2x2 table it is (2-1)x(2-1) = 1对于 2x2 表，它是 (2-1)x(2-1) = 1

Python、Pandas 和卡方独立性检验

问题描述

2 个解决方案

解决方案1
12 已采纳 2017-05-14 13:04:25

解决方案2
-1 2020-03-16 09:11:35

Python、Pandas 和卡方独立性检验

问题描述

2 个解决方案

解决方案1 12 已采纳 2017-05-14 13:04:25

解决方案2 -1 2020-03-16 09:11:35

解决方案1
12 已采纳 2017-05-14 13:04:25

解决方案2
-1 2020-03-16 09:11:35