[英]Generate Group column based on a column data
I am new to R, I am trying to introduce a group column based on the data in a column. 我是R的新手,我正在尝试根据列中的数据引入一个组列。
Example of the data.frame data.frame的示例
1 11.3178501 4 9 11.618880
2 10.3969713 20 8 11.047486
8 9.5067421 14 7 10.079806
6 6.6135932 6 6 7.002669
4 5.4157174 2 5 5.566232
17 3.8860793 5 4 4.235564
16 3.8183699 15 3 4.406416
11 1.2574765 18 2 1.885113
15 0.7084411 7 1 1.130990
First column is the index introduced by the R but I sorted so the order is different, what I am trying to do is introduce a column that defines the bracket that each row is belong to based on the last column value. 第一列是R引入的索引,但我进行了排序,因此顺序有所不同,我要做的是引入一列,该列基于最后一列的值定义每行所属的括号。 so if last column value is between
0-5 => 1, 5-0 => 2
etc then we add a new column at the end group -> 1,2,3...
因此,如果最后一列的值介于
0-5 => 1, 5-0 => 2
等之间,则我们在末尾group -> 1,2,3...
添加一个新列group -> 1,2,3...
16 3.8183699 15 3 4.406416 1
11 1.2574765 18 2 1.885113 2
15 0.7084411 7 1 1.130990 2
I tried the following dataFrame$column4 < 5
but this gave me a boolean value so I thought I'll multiply that by 1 then i got the following 我尝试了以下
dataFrame$column4 < 5
但这给了我一个布尔值,所以我想将它乘以1,然后得到以下结果
0 0 0 0 0 1 1 1 1
I am not sure if I am on the right track. 我不确定自己是否走对了。
Even given your comment, I would still suggest cut
. 即使给出您的评论,我仍然建议您使用
cut
。 It is in base R and usually not considered a fancy function. 它位于基数R中,通常不被视为幻想函数。
df <- structure(list(V1 = c(1L, 2L, 8L, 6L, 4L, 17L, 16L, 11L, 15L),
V2 = c(11.3178501, 10.3969713, 9.5067421, 6.6135932, 5.4157174,
3.8860793, 3.8183699, 1.2574765, 0.7084411), V3 = c(4L, 20L,
14L, 6L, 2L, 5L, 15L, 18L, 7L), V4 = c(9L, 8L, 7L, 6L, 5L,
4L, 3L, 2L, 1L), V5 = c(11.61888, 11.047486, 10.079806, 7.002669,
5.566232, 4.235564, 4.406416, 1.885113, 1.13099)), .Names = c("V1",
"V2", "V3", "V4", "V5"), class = "data.frame", row.names = c(NA,
-9L))
df$groups <- cut(df$V5, seq(0,15, by=5))
> df
V1 V2 V3 V4 V5 groups
1 1 11.3178501 4 9 11.618880 (10,15]
2 2 10.3969713 20 8 11.047486 (10,15]
3 8 9.5067421 14 7 10.079806 (10,15]
4 6 6.6135932 6 6 7.002669 (5,10]
5 4 5.4157174 2 5 5.566232 (5,10]
6 17 3.8860793 5 4 4.235564 (0,5]
7 16 3.8183699 15 3 4.406416 (0,5]
8 11 1.2574765 18 2 1.885113 (0,5]
9 15 0.7084411 7 1 1.130990 (0,5]
>
Finally, if integers are what you want, you can coerce the groups
to integers using factor
. 最后,如果要使用整数,则可以使用
factor
将groups
强制为整数。
df$groups <- as.integer(df$groups)
> as.integer(df$groups)
[1] 3 3 3 2 2 1 1 1 1
Justin's answer is great; 贾斯汀的答案很好。 yet if you want to implement dumber cut on your own, you can do this this way.
但是,如果您想自己实施伐木,则可以采用这种方式。 First, you define a vector with your thresholds, like
thre<-c(0,5,10,15)
, then do an outer comparison of your values and those thresholds with greater-than operator and sum the rows of such created matrix like this: 首先,定义一个带有阈值的向量,如
thre<-c(0,5,10,15)
,然后使用大于运算符对值和那些阈值进行外部比较,并对此类创建的矩阵的行求和这个:
rowSums(outer(values,thre,'>'))
And voila, all values in (0,5] are now 1, (5,10] are 2, etc. 瞧,(0,5]中的所有值现在都是1,(5,10]是2,依此类推。
Wrapped in a function, it could look like this: 包裹在一个函数中,它看起来可能像这样:
ultraDumbCut<-function(v,thre) rowSums(outer(v,thre,'>'))
Made a bit more intelligent, like this: 变得更聪明,像这样:
dumbCut<-function(v,jump=5,thre=seq(0,max(v),by=jump)) rowSums(outer(v,thre,'>'))
so that dumbCut(1:7)
is 1 1 1 1 1 2 2
, dumbCut(1:7,3)
is 1 1 1 2 2 2 3
and dumbCut(1:7,thre=c(0,2,3,5))
is 1 1 2 3 3 4 4
. 因此
dumbCut(1:7)
为1 1 1 1 1 2 2
, dumbCut(1:7,3)
为1 1 1 2 2 2 3
和dumbCut(1:7,thre=c(0,2,3,5))
是1 1 2 3 3 4 4
。
Next step is to convert the output to a factor (because using numbers for categories in R is simply a masochism) and generate a meaningful level names, so basically replicating actual cut
. 下一步是将输出转换为一个因数(因为在R中使用数字作为类别只是受虐狂)并生成有意义的级别名称,因此基本上可以复制实际的
cut
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.