簡體   English   中英

來自重采樣方法與 scipy.stats.chi2_contigency 的卡方檢驗 P 值

[英]Chi-square test P-value from resampled method vs scipy.stats.chi2_contigency

本題參考“O'Relly Practical Statistics for Data Scientist 2nd Edition”第3章,會話卡方檢驗。

這本書提供了一個卡方測試用例的例子,它假設一個網站有三個不同的標題,有 1000 名訪問者。 結果顯示了每個標題的點擊次數。

觀察到的數據如下:

Headline   A    B    C
Click      14   8    12
No-click   986  992  988

預期值計算如下:

Headline   A        B        C
Click      11.13    11.13    11.13
No-click   988.67   988.67   988.67

Pearson 殘差定義為: 皮爾遜殘差

桌子現在在哪里:

Headline   A        B        C
Click      0.792    -0.990   0.198
No-click   -0.085   0.106   -0.021

卡方統計量是 Pearson 殘差平方的總和: 在此處輸入圖片說明 . 這是 1.666

到現在為止還挺好。 現在是重采樣部分:

1. Assuming a box of 34 ones and 2966 zeros
2. Shuffle, and take three samples of 1000 and count how many ones(Clicks)
3. Find the squared differences between the shuffled counts and expected counts then sum them.
4. Repeat steps 2 to 3, a few thousand times.
5. The P-value is how often does the resampled sum of squared deviations exceed the observed.

本書提供的重采樣python測試代碼如下:(可從https://github.com/gedeck/practical-statistics-for-data-scientists/tree/master/python/code下載)

## Practical Statistics for Data Scientists (Python)
## Chapter 3. Statistial Experiments and Significance Testing
# > (c) 2019 Peter C. Bruce, Andrew Bruce, Peter Gedeck

# Import required Python packages.

from pathlib import Path
import random

import pandas as pd
import numpy as np

from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats import power

import matplotlib.pylab as plt

DATA = Path('.').resolve().parents[1] / 'data'

# Define paths to data sets. If you don't keep your data in the same directory as the code, adapt the path names.

CLICK_RATE_CSV = DATA / 'click_rates.csv'

...

## Chi-Square Test
### Chi-Square Test: A Resampling Approach

# Table 3-4
click_rate = pd.read_csv(CLICK_RATE_CSV)
clicks = click_rate.pivot(index='Click', columns='Headline', values='Rate')
print(clicks)

# Table 3-5
row_average = clicks.mean(axis=1)
pd.DataFrame({
    'Headline A': row_average,
    'Headline B': row_average,
    'Headline C': row_average,
})

# Resampling approach
box = [1] * 34
box.extend([0] * 2966)
random.shuffle(box)

def chi2(observed, expected):
    pearson_residuals = []
    for row, expect in zip(observed, expected):
        pearson_residuals.append([(observe - expect) ** 2 / expect
                                  for observe in row])
    # return sum of squares
    return np.sum(pearson_residuals)

expected_clicks = 34 / 3
expected_noclicks = 1000 - expected_clicks
expected = [34 / 3, 1000 - 34 / 3]
chi2observed = chi2(clicks.values, expected)

def perm_fun(box):
    sample_clicks = [sum(random.sample(box, 1000)),
                     sum(random.sample(box, 1000)),
                     sum(random.sample(box, 1000))]
    sample_noclicks = [1000 - n for n in sample_clicks]
    return chi2([sample_clicks, sample_noclicks], expected)

perm_chi2 = [perm_fun(box) for _ in range(2000)]

resampled_p_value = sum(perm_chi2 > chi2observed) / len(perm_chi2)

print(f'Observed chi2: {chi2observed:.4f}')
print(f'Resampled p-value: {resampled_p_value:.4f}')

chisq, pvalue, df, expected = stats.chi2_contingency(clicks)
print(f'Observed chi2: {chi2observed:.4f}')
print(f'p-value: {pvalue:.4f}')

現在,我運行 perm_fun(box) 2,000 次並獲得了 0.4775 的重采樣 P 值。 但是,如果我運行 perm_fun(box) 10,000 次和 100,000 次,兩次都能夠獲得 0.84 的重采樣 P 值。 在我看來,P 值應該在 0.84 左右。 為什么 stats.chi2_contigency 顯示的數字如此之小?

我運行 2000 次的結果是:

Observed chi2: 1.6659
Resampled p-value: 0.8300
Observed chi2: 1.6659
p-value: 0.4348

如果我運行它 10,000 次,結果是:

Observed chi2: 1.6659
Resampled p-value: 0.8386
Observed chi2: 1.6659
p-value: 0.4348

軟件版本:

pandas.__version__:         0.25.1
numpy.__version__:          1.16.5
scipy.__version__:          1.3.1
statsmodels.__version__:    0.10.1
sys.version_info:           3.7.4

我運行了您的代碼,嘗試了 2000、10000 和 100000 次循環,並且所有 3 次都接近 0.47。 但是,我確實在這一行遇到了一個我必須修復的錯誤:

resampled_p_value = sum(perm_chi2 > chi2observed) / len(perm_chi2)

這里perm_chi2是一個列表, chi2observed是一個浮點數,所以我想知道這段代碼是如何為你運行的(也許你為修復它所做的一切都是錯誤的根源)。 無論如何,將其更改為預期的

resampled_p_value = sum([1*(x > chi2observed) for x in perm_chi2]) / len(perm_chi2)

允許我運行它並接近 0.47。

確保在更改迭代次數時,只更改 2000,而不更改其他任何數字。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM