简体   繁体   中英

R: Adjusting exploratory variable's distribution to known non-normal distribution

I have data for a sample of the US population. The dataset for the sample has N = 10,000 records. Each row is described by a quantitative explanatory variable E , a price that affects the probability R that people return a bought item. It is necessary for the sample and population to have similar distribution of E to ensure validity of statistical models linking it to R .

There is a significant discrepancy between the frequency distributions of E in the US population and in the sample (see summary below). In particular, a normal distribution does not seem to describe well the population distribution.

Value of E  Population Distribution of E    Sample Distribution of E
0-10        56.57%  92.95%
10.01 - 20  6.90%   1.19%
20.01 - 30  8.29%   1.38%
30.01-40    5.87%   0.85%
40.01 - 50  8.18%   0.32%
50.01 - 60  4.63%   0.48%
60.01-70    1.34%   0.32%
70.01 - 80  1.50%   0.08%
80.01 - 90  0.29%   0.49%
90.01-100   3.72%   1.12%
100.01-110  2.10%   0.69%
110.01-120  0.24%   0.00%
120.01+     0.35%   0.13%

What are good things to do in R to make the sample's E -distribution more akin to the population's, hopefully to match it? I have tried filtering off sample data with low E values to no avail. At the same time, I am not quite sure which transformations to use since most of the common transformations attempt to fit data to a normal distribution --- which does not seem applicable here.

I myself think that transformations (possibly including weightings) of E are permissible, deletion of rows borderline acceptable, and creation of new rows forbidden --- but I would appreciate any input on what operations are usually considered permissible in contexts similar to mine.

The best way to this would be using prediction intervals. It is clear that most of your sample has very low values for E. This means that you are relatively confident about the predicted value of R for low values of E. However, as you move farther away from the range of your data (ie very high values of E), you are much less confident about your predictions for R.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM