简体   繁体   中英

P-value in scipy.stats does not reflect the reality

I am not sure whether this is a question for Stack Overflow or for Math Stack Exchange.

I have data about the cost of crashes of cars A, and the data about the cost of crashes of cars B.

There were 15 992 crashes of type B, with the total cost of 19 890 980. Average cost of a crash of cars B was 1541.808.

Then, there were 2760 crashes of type A with the total cost of 4 255 390. The average cost of a crash of cars A was 1243.808.

It is apparent that the mean of the cost of crashes of cars A should be lower than the one of cars B. I want to test this using a t-test. The null hypothesis is "The means are equal". The alpha is 5%.

However, when I run the following in python

ttest_ind(table[B], table2[A],  alternative="less",equal_var=False)

The result I get is this: (and the p value would indicate that mean of the cost of the crash of cars B is NOT less than the mean of A, which does not make sense).

Ttest_indResult(statistic=3.417269886834147, pvalue=0.9996071028578007)

If I, however, run this (without the alternative)

ttest_ind(table[B], table2[A], equal_var=False)

I get

Ttest_indResult(statistic=3.417269886834147, pvalue=0.0007857942843984687)

Why does the first function which uses "alternative" produce the weirdly high p-value? Is there something I understand incorrectly about the p-values?

You have your sample order inverted. Use instead:

ttest_ind(table[A], table2[B],  alternative="less", equal_var=False)

From the docs , under the alternative argument:

'less': the mean of the distribution underlying the first sample is less than the mean of the distribution underlying the second sample.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM