简体   繁体   中英

How can i transform random distribution to (predefined) integer values in R?

In R there's a lot of samples with which I can generate simulated data according to a specific distribution. For example:

rnorm(N, 0, 1)
runif(N, 0, 1)

which gives me a set of random values that are basically real numbers. For some reason, however, I would like to get a result based on a set of integers, for example integers from 1 to 10, something like c(1:10).

Is there any simple function that can transform, for example obtained normal distribution of real values to (pseudo)normal distribution of indicated range of integer values?

EDIT: In social sciences, the observed variables are most often questionnaire scores. The results of these questionnaires are scored in integer numbers. The subject cannot score 1.5 points, only 1 or 2 points. Nevertheless, a normal distribution of the results can be obtained. I am looking for a function that generates such a distribution within integer results.

Other background: Standard Ten Scale converts a range of normalized results to an integer range. I am looking for a similar function for any distribution and any range of "stens".

To bin any real-valued variable, including samples from a continuous distribution, you can use cut, followed by a cast of the generated factor variable to an integer variable.

If you wish to convert to a Standard Ten score then the breaks in the cut function will be based on the Z scores, which in the case of a standard Normal are the sample values.

# Generate the binned variable:
as.integer(cut(rnorm(1000), breaks=c(-Inf, -2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, 2, Inf)))
# Distribution of the binned variable:
table(cut(rnorm(1000), breaks=c(-Inf, -2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, 2, Inf)))

For a uniform RV from 0 to 1, the corresponding code to generate a 10 level discrete RV and examine its distribution may be:

as.integer(cut(runif(1000), breaks=c(-Inf, 1:9*0.1, Inf)))
table(cut(runif(1000), breaks=c(-Inf, 1:9*0.1, Inf)))

In general, you need to decide on the breaks (the boundaries of the bins). That is a conceptual question. You could choose to use the properties of the distribution you are sampling for (as in the case of standard 10). Or you could use the distribution of the samples. If you wish to use the distribution of the samples then the quantile function maybe useful.

For completeness, note that a binned continuous rv is a discrete categorical rv with probability of occurrence of each level corresponding to the bins. In the trivial case, if you bin a continuous uniform into 10 equal sized bins then the generated discrete variable is the categorical with 10 events and equal probability of each event. In the case of the standard Normal and standard 10, the probability of each break can be generated using the cdf. Eg, probability of (-Inf, -2] is pnorm(-2) - pnorm(-Inf) , and so on so forth for the various breaks. These values can be used to define the standard 10 score distribution as a categorical distribution with probability of each event distributed using the computed values from above. See package extraDists for functions to sample from a categorical.

A binomial distribution is fixed to a discrete and fix number of values and approximates a normal distribution:

y <- table(rbinom(500, 10, prob = .5))
x <- dimnames(y)[[1]]
y <- as.integer(y)
plot(x = x, y = y, type = "h")
points(x, y, pch = 15)

在此处输入图像描述

After trying many different options, I decided that the solution to my problem would be to simply transform obtained random variable to a different range and round it. For this, I created another post - about transformation - and used a transform function from that other post. This allows me to roughly maintain the distribution of a given variable and its properties by simply adjusting it to a different range of maximum and minimum values. This also allows me to use any random distribution as an input parameter.

# this is scale function by Allan Cameron, see other post linked
linscale_to_int <- function(y, x) (x - min(x)) * (y - 1) / diff(range(x)) + 1

# you can try any of this distribution
# x.rand <- rnorm(500,0,1)
# x.rand <- runif(50, 0, 1)
x.rand <- rnorm(100)
# let's change scope of variable
y.rand <- linscale_to_int(20,x.rand)
# and then we can round it
y.round <- round(y.rand)
# we may check it's distibution by plot
x.pl <- dimnames(table(y.round))[[1]]
y.pl <- as.integer(table(y.round))
plot(x = x.pl, y = y.pl, type = "h")
# or check it with test
shapiro.test(y.round)

Note: not every repetition of this algorithm gives a fully satisfactory effect, because when a small sample is selected randomly, it may always happen that rounding does not allow for the creation of a variable with a distribution very similar to the normal distribution. But it works anyway for me. Or - one can made an loop with randomizing, and then get best one (with largest p-value of shapiro.test$p.value)

Thanks to everyone for solutions provided!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM