Can I use hypothesis Testing on Train and Test data?

Question

I was wondering if I could use Hypothesis Testing against trainning and testing data, after splitting my dataset.

My objective is to check if both of the data samples group are well balanced, distributed and so Will provide a Nice environment for the ML model to be applied.

If so, I would expect the H0 (null hypothesis) to be accepted ie I hope testing data is a "microcosm" of trainning data

Or

I expect the H1 (alternative hypothesis) to be accepted ie for the sake of checking the "foundations" of my ML environment, I should expect to find differences between both samples?

Assuming my data points have more than 1000 data points, they follow a Gaussian distribution and are independent, would Z-test be a good strategy?

Answer 1

Yes, you can run a hypothesis test to essentially "validate" that both test and train data from the "the same distribution". To do so, you could implement a hypothesis test that sets:

H_0: Train and test data come from the same distribution
H_1: Train and test data come do not come from the same distribution

To do so, you don't need to necessarily make assumptions about the shape of the data (eg that it comes from a Gaussian distribution), just pick a test appropriate for the type of data you're dealing with (categorical, numeric continuous, numeric discrete, etc). For example, you could apply the Kolmogorov–Smirnov test or the Kruskal–Wallis test (both are implemented in scipy.stats , eg the scipy.stats.kstest ). I wouldn't recommend the Z-test (or the t-test in fact), as all it's usually used to compare whether the means of two samples are the same, not that they come from the same distributions necessarily.

It should be noted that although you mention test and train data as if you're comparing them on a single dimension, if you have multiple features/columns, each pair of columns should be compared separately. As a real life example, a subset of students selected "presumably randomly" from a school could have the same height (or come from "the same distribution of heights") as the rest of the students, but they could have completely different grades from them.

Finally, just to note that in formal hypothesis testing language you cannot "accept" a null hypothesis, but only "fail to reject it" (see here on Crossvalidated).

Can I use hypothesis Testing on Train and Test data?

Question

1 answers

solution1
1 ACCPTED 2020-11-03 17:50:33

Can I use hypothesis Testing on Train and Test data?

Question

1 answers

solution1 1 ACCPTED 2020-11-03 17:50:33

solution1
1 ACCPTED 2020-11-03 17:50:33