简体   繁体   中英

How to control for sampling bias?

I have a theoretical question in regards to confidence intervals for simple regression. Say we a set of a data (x_i,y_i). Theoretically if we took a biased sample (ie we tried to determine the effect of age on BMI but only sampled professional athletes) and thus all of our data points were in the tail of the true normal distribution, all our inference tools would be biased and invalid, correct? Is there a way to control for these types of issues (obviously not a mistake as obvious as my example above)? Thank you!

Short answer: Not really.

Long answer: In theory, you should randomly sample your observations from the "population". However, when your population consists of humans (as opposed eg of bacteria), this might be hard to achieve. Humans have their own free will and might decline to participate in your study.

For that reason, social scientists and epidemiologists (and maybe others that I'm not aware of) like to use "representative" samples. Briefly, when collecting their sample they also collect a bunch of "control variables", which are not used for answering their actual question, but for checking that the sample is "representative". Depending on the question and your field of work these variables might be age, sex, level of education, income, ZIP code etc. Needless to say, the distribution of these values for the population needs to be known, eg through census. You can then compare the population values with the sample values and, if they match, conclude that the sample is "representative of the population" and proceed with the actual analysis.

However, which variables are suitable to establish whether the sample is representative is hard to define. Assume you want to test your new sunscreen. You collect a sample which is representative by age, sex, and income, and conclude that your sunscreen works perfectly. Alas, you have forgotten to compare the complexion of your sample with the population. If your sample consisted of exclusively dark-skinned people, it is not representative for the purpose of your study .

There is, to my knowledge, no universal recipe for choosing the control variables. You might check for complexion, but it might be that people with a certain genetical variation have allergic reaction to a compound in your sunscreen. If that variation is not uncommon in the population, but you have none in your sample, the sample is again not representative.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM