规范化R data.frame列中的数据

Question

Suppose I have the following data: 假设我有以下数据：

a <- data.frame(var1=letters,var2=runif(26))

Suppose I want to scale every value in var2 such that the sum of the var2 column is equal to 1 (basically turn the var2 column into a probability distribution) 假设我想缩放var2每个值，使得var2列的总和等于1（基本上将var2列转换为概率分布）

I have tried the following: 我尝试过以下方法：

a$var2 <- lapply(a$var2,function(x) (x-min(a$var2))/(max(a$var2)-min(a$var2)))

this not only gives an overall sum greater than 1 but also turns the var2 column into a list on which I can't do operations like sum 这不仅给出了大于1的总和，而且还将var2列变成了一个列表，在这个列表中我不能像sum那样进行操作

Is there any valid way of turning this column into a probability distribution? 是否有任何有效的方法将此列转换为概率分布？

Answer 1

Suppose you have a vector x with non-negative values and no NA , you can normalize it by 假设您有一个非负值且没有NA的向量x ，您可以将其标准化

x / sum(x)

which is a proper probability mass function. 这是一个适当的概率质量函数。

The transform you take: 你采取的转变：

(x - min(x)) / (max(x) - min(x))

only rescales x onto [0, 1] , but does not ensure "summation to 1". 仅将x缩放到[0, 1] ，但不确保“总和为1”。

Regarding you code 关于你的代码

There is no need to use lapply here: 这里没有必要使用lapply ：

lapply(a$var2, function(x) (x-min(a$var2)) / (max(a$var2) - min(a$var2)))

Just use vectorized operation 只需使用矢量化操作

a$var2 <- with(a, (var2 - min(var2)) / (max(var2) - min(var2)))

As you said, lapply gives you a list, and that is what "l" in "lapply" refers to. 如你所说， lapply给你一个列表，这就是“lapply”所指的“l”。 You can use unlist to collapse that list into a vector; 您可以使用unlist将该列表折叠为矢量; or, you can use sapply , where "s" implies "simplification (when possible)". 或者，您可以使用sapply ，其中“s”表示“简化（如果可能）”。

规范化R data.frame列中的数据

问题描述

1 个解决方案

解决方案1
3 已采纳 2016-09-05 02:50:05

规范化R data.frame列中的数据

问题描述

1 个解决方案

解决方案1 3 已采纳 2016-09-05 02:50:05

解决方案1
3 已采纳 2016-09-05 02:50:05