简体   繁体   English

如何将随机分布转换为 R 中的(预定义)integer 值?

[英]How can i transform random distribution to (predefined) integer values in R?

In R there's a lot of samples with which I can generate simulated data according to a specific distribution.在 R 中有很多样本,我可以根据特定分布生成模拟数据。 For example:例如:

rnorm(N, 0, 1)
runif(N, 0, 1)

which gives me a set of random values that are basically real numbers.这给了我一组基本上是实数的随机值。 For some reason, however, I would like to get a result based on a set of integers, for example integers from 1 to 10, something like c(1:10).但是,出于某种原因,我想获得基于一组整数的结果,例如从 1 到 10 的整数,例如 c(1:10)。

Is there any simple function that can transform, for example obtained normal distribution of real values to (pseudo)normal distribution of indicated range of integer values?是否有任何简单的 function 可以转换,例如获得的实际值的正态分布到 integer 值的指示范围的(伪)正态分布?

EDIT: In social sciences, the observed variables are most often questionnaire scores.编辑:在社会科学中,观察到的变量通常是问卷分数。 The results of these questionnaires are scored in integer numbers.这些问卷的结果以 integer 数字计分。 The subject cannot score 1.5 points, only 1 or 2 points.该科目不能得1.5分,只有1分或2分。 Nevertheless, a normal distribution of the results can be obtained.然而,可以得到结果的正态分布。 I am looking for a function that generates such a distribution within integer results.我正在寻找在 integer 结果中生成这样的分布的 function。

Other background: Standard Ten Scale converts a range of normalized results to an integer range.其他背景:标准十级将一系列标准化结果转换为 integer 范围。 I am looking for a similar function for any distribution and any range of "stens".我正在为任何分布和任何“支架”范围寻找类似的 function。

To bin any real-valued variable, including samples from a continuous distribution, you can use cut, followed by a cast of the generated factor variable to an integer variable.要将任何实值变量(包括来自连续分布的样本)分箱,您可以使用 cut,然后将生成的因子变量转换为 integer 变量。

If you wish to convert to a Standard Ten score then the breaks in the cut function will be based on the Z scores, which in the case of a standard Normal are the sample values.如果您希望转换为标准十分数,则切割 function 中的中断将基于 Z 分数,在标准正态的情况下是样本值。

# Generate the binned variable:
as.integer(cut(rnorm(1000), breaks=c(-Inf, -2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, 2, Inf)))
# Distribution of the binned variable:
table(cut(rnorm(1000), breaks=c(-Inf, -2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, 2, Inf)))

For a uniform RV from 0 to 1, the corresponding code to generate a 10 level discrete RV and examine its distribution may be:对于从 0 到 1 的统一 RV,生成 10 级离散 RV 并检查其分布的相应代码可能是:

as.integer(cut(runif(1000), breaks=c(-Inf, 1:9*0.1, Inf)))
table(cut(runif(1000), breaks=c(-Inf, 1:9*0.1, Inf)))

In general, you need to decide on the breaks (the boundaries of the bins).一般来说,您需要决定休息时间(垃圾箱的边界)。 That is a conceptual question.这是一个概念问题。 You could choose to use the properties of the distribution you are sampling for (as in the case of standard 10).您可以选择使用您要采样的分布的属性(如标准 10 的情况)。 Or you could use the distribution of the samples.或者您可以使用样本的分布。 If you wish to use the distribution of the samples then the quantile function maybe useful.如果您希望使用样本的分布,那么分位数 function 可能有用。

For completeness, note that a binned continuous rv is a discrete categorical rv with probability of occurrence of each level corresponding to the bins.为了完整起见,请注意分箱的连续 rv 是离散的分类 rv,其中每个级别的出现概率对应于箱。 In the trivial case, if you bin a continuous uniform into 10 equal sized bins then the generated discrete variable is the categorical with 10 events and equal probability of each event.在简单的情况下,如果您将连续制服分为 10 个大小相等的箱,则生成的离散变量是具有 10 个事件且每个事件的概率相等的分类变量。 In the case of the standard Normal and standard 10, the probability of each break can be generated using the cdf.在标准 Normal 和标准 10 的情况下,可以使用 cdf 生成每次中断的概率。 Eg, probability of (-Inf, -2] is pnorm(-2) - pnorm(-Inf) , and so on so forth for the various breaks. These values can be used to define the standard 10 score distribution as a categorical distribution with probability of each event distributed using the computed values from above. See package extraDists for functions to sample from a categorical.例如,(-Inf, -2] 的概率是pnorm(-2) - pnorm(-Inf) ,依此类推,用于各种中断。这些值可用于将标准 10 分数分布定义为分类分布使用上面的计算值分布每个事件的概率。有关从分类中采样的函数,请参见 package extraDists

A binomial distribution is fixed to a discrete and fix number of values and approximates a normal distribution:二项分布固定为离散且固定数量的值,并近似于正态分布:

y <- table(rbinom(500, 10, prob = .5))
x <- dimnames(y)[[1]]
y <- as.integer(y)
plot(x = x, y = y, type = "h")
points(x, y, pch = 15)

在此处输入图像描述

After trying many different options, I decided that the solution to my problem would be to simply transform obtained random variable to a different range and round it.在尝试了许多不同的选项之后,我决定解决我的问题的方法是简单地将获得的随机变量转换为不同的范围并对其进行四舍五入。 For this, I created another post - about transformation - and used a transform function from that other post.为此,我创建了另一篇关于转换的文章,并使用了另一篇文章中的转换 function。 This allows me to roughly maintain the distribution of a given variable and its properties by simply adjusting it to a different range of maximum and minimum values.这使我可以通过简单地将其调整到不同的最大值和最小值范围来粗略地保持给定变量及其属性的分布。 This also allows me to use any random distribution as an input parameter.这也允许我使用任何随机分布作为输入参数。

# this is scale function by Allan Cameron, see other post linked
linscale_to_int <- function(y, x) (x - min(x)) * (y - 1) / diff(range(x)) + 1

# you can try any of this distribution
# x.rand <- rnorm(500,0,1)
# x.rand <- runif(50, 0, 1)
x.rand <- rnorm(100)
# let's change scope of variable
y.rand <- linscale_to_int(20,x.rand)
# and then we can round it
y.round <- round(y.rand)
# we may check it's distibution by plot
x.pl <- dimnames(table(y.round))[[1]]
y.pl <- as.integer(table(y.round))
plot(x = x.pl, y = y.pl, type = "h")
# or check it with test
shapiro.test(y.round)

Note: not every repetition of this algorithm gives a fully satisfactory effect, because when a small sample is selected randomly, it may always happen that rounding does not allow for the creation of a variable with a distribution very similar to the normal distribution.注意:并非该算法的每次重复都会产生完全令人满意的效果,因为当随机选择小样本时,可能总是会发生舍入不允许创建分布与正态分布非常相似的变量的情况。 But it works anyway for me.但无论如何它对我有用。 Or - one can made an loop with randomizing, and then get best one (with largest p-value of shapiro.test$p.value)或者 - 一个可以随机循环,然后得到最好的一个(具有最大的 shapiro.test$p.value 的 p 值)

Thanks to everyone for solutions provided!感谢大家提供的解决方案!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM