[英]Chi-square test P-value from resampled method vs scipy.stats.chi2_contigency
本題參考“O'Relly Practical Statistics for Data Scientist 2nd Edition”第3章,會話卡方檢驗。
這本書提供了一個卡方測試用例的例子,它假設一個網站有三個不同的標題,有 1000 名訪問者。 結果顯示了每個標題的點擊次數。
觀察到的數據如下:
Headline A B C
Click 14 8 12
No-click 986 992 988
預期值計算如下:
Headline A B C
Click 11.13 11.13 11.13
No-click 988.67 988.67 988.67
桌子現在在哪里:
Headline A B C
Click 0.792 -0.990 0.198
No-click -0.085 0.106 -0.021
卡方統計量是 Pearson 殘差平方的總和: . 這是 1.666
到現在為止還挺好。 現在是重采樣部分:
1. Assuming a box of 34 ones and 2966 zeros
2. Shuffle, and take three samples of 1000 and count how many ones(Clicks)
3. Find the squared differences between the shuffled counts and expected counts then sum them.
4. Repeat steps 2 to 3, a few thousand times.
5. The P-value is how often does the resampled sum of squared deviations exceed the observed.
本書提供的重采樣python測試代碼如下:(可從https://github.com/gedeck/practical-statistics-for-data-scientists/tree/master/python/code下載)
## Practical Statistics for Data Scientists (Python)
## Chapter 3. Statistial Experiments and Significance Testing
# > (c) 2019 Peter C. Bruce, Andrew Bruce, Peter Gedeck
# Import required Python packages.
from pathlib import Path
import random
import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats import power
import matplotlib.pylab as plt
DATA = Path('.').resolve().parents[1] / 'data'
# Define paths to data sets. If you don't keep your data in the same directory as the code, adapt the path names.
CLICK_RATE_CSV = DATA / 'click_rates.csv'
...
## Chi-Square Test
### Chi-Square Test: A Resampling Approach
# Table 3-4
click_rate = pd.read_csv(CLICK_RATE_CSV)
clicks = click_rate.pivot(index='Click', columns='Headline', values='Rate')
print(clicks)
# Table 3-5
row_average = clicks.mean(axis=1)
pd.DataFrame({
'Headline A': row_average,
'Headline B': row_average,
'Headline C': row_average,
})
# Resampling approach
box = [1] * 34
box.extend([0] * 2966)
random.shuffle(box)
def chi2(observed, expected):
pearson_residuals = []
for row, expect in zip(observed, expected):
pearson_residuals.append([(observe - expect) ** 2 / expect
for observe in row])
# return sum of squares
return np.sum(pearson_residuals)
expected_clicks = 34 / 3
expected_noclicks = 1000 - expected_clicks
expected = [34 / 3, 1000 - 34 / 3]
chi2observed = chi2(clicks.values, expected)
def perm_fun(box):
sample_clicks = [sum(random.sample(box, 1000)),
sum(random.sample(box, 1000)),
sum(random.sample(box, 1000))]
sample_noclicks = [1000 - n for n in sample_clicks]
return chi2([sample_clicks, sample_noclicks], expected)
perm_chi2 = [perm_fun(box) for _ in range(2000)]
resampled_p_value = sum(perm_chi2 > chi2observed) / len(perm_chi2)
print(f'Observed chi2: {chi2observed:.4f}')
print(f'Resampled p-value: {resampled_p_value:.4f}')
chisq, pvalue, df, expected = stats.chi2_contingency(clicks)
print(f'Observed chi2: {chi2observed:.4f}')
print(f'p-value: {pvalue:.4f}')
現在,我運行 perm_fun(box) 2,000 次並獲得了 0.4775 的重采樣 P 值。 但是,如果我運行 perm_fun(box) 10,000 次和 100,000 次,兩次都能夠獲得 0.84 的重采樣 P 值。 在我看來,P 值應該在 0.84 左右。 為什么 stats.chi2_contigency 顯示的數字如此之小?
我運行 2000 次的結果是:
Observed chi2: 1.6659
Resampled p-value: 0.8300
Observed chi2: 1.6659
p-value: 0.4348
如果我運行它 10,000 次,結果是:
Observed chi2: 1.6659
Resampled p-value: 0.8386
Observed chi2: 1.6659
p-value: 0.4348
軟件版本:
pandas.__version__: 0.25.1
numpy.__version__: 1.16.5
scipy.__version__: 1.3.1
statsmodels.__version__: 0.10.1
sys.version_info: 3.7.4
我運行了您的代碼,嘗試了 2000、10000 和 100000 次循環,並且所有 3 次都接近 0.47。 但是,我確實在這一行遇到了一個我必須修復的錯誤:
resampled_p_value = sum(perm_chi2 > chi2observed) / len(perm_chi2)
這里perm_chi2
是一個列表, chi2observed
是一個浮點數,所以我想知道這段代碼是如何為你運行的(也許你為修復它所做的一切都是錯誤的根源)。 無論如何,將其更改為預期的
resampled_p_value = sum([1*(x > chi2observed) for x in perm_chi2]) / len(perm_chi2)
允許我運行它並接近 0.47。
確保在更改迭代次數時,只更改 2000,而不更改其他任何數字。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.