[英]Normalize data in R data.frame column
Suppose I have the following data: 假设我有以下数据:
a <- data.frame(var1=letters,var2=runif(26))
Suppose I want to scale every value in var2
such that the sum of the var2
column is equal to 1 (basically turn the var2 column into a probability distribution) 假设我想缩放
var2
每个值,使得var2
列的总和等于1(基本上将var2列转换为概率分布)
I have tried the following: 我尝试过以下方法:
a$var2 <- lapply(a$var2,function(x) (x-min(a$var2))/(max(a$var2)-min(a$var2)))
this not only gives an overall sum greater than 1 but also turns the var2
column into a list on which I can't do operations like sum
这不仅给出了大于1的总和,而且还将
var2
列变成了一个列表,在这个列表中我不能像sum
那样进行操作
Is there any valid way of turning this column into a probability distribution? 是否有任何有效的方法将此列转换为概率分布?
Suppose you have a vector x
with non-negative values and no NA
, you can normalize it by 假设您有一个非负值且没有
NA
的向量x
,您可以将其标准化
x / sum(x)
which is a proper probability mass function. 这是一个适当的概率质量函数。
The transform you take: 你采取的转变:
(x - min(x)) / (max(x) - min(x))
only rescales x
onto [0, 1]
, but does not ensure "summation to 1". 仅将
x
缩放到[0, 1]
,但不确保“总和为1”。
Regarding you code 关于你的代码
There is no need to use lapply
here: 这里没有必要使用
lapply
:
lapply(a$var2, function(x) (x-min(a$var2)) / (max(a$var2) - min(a$var2)))
Just use vectorized operation 只需使用矢量化操作
a$var2 <- with(a, (var2 - min(var2)) / (max(var2) - min(var2)))
As you said, lapply
gives you a list, and that is what "l" in "lapply" refers to. 如你所说,
lapply
给你一个列表,这就是“lapply”所指的“l”。 You can use unlist
to collapse that list into a vector; 您可以使用
unlist
将该列表折叠为矢量; or, you can use sapply
, where "s" implies "simplification (when possible)". 或者,您可以使用
sapply
,其中“s”表示“简化(如果可能)”。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.