R data.table：如何“标记”列中的连续值？

Question

I have the following data.table (though it's ok if you use it as a data.frame) 我有以下data.table（如果你把它用作data.frame就可以了）

library(data.table)

dt <- data.table(first_column = c("item1", "item2", "item3", "item4", "item5", "item6", "item7"),
second_column = c("cat1", "cat1", "cat1", "cat2", "cat2", "cat2", "cat2"), third_column = c(50, 10, 18, 3092, 731, 189, 1991))

> dt
   first_column second_column third_column
1:        item1          cat1           50
2:        item2          cat1           10
3:        item3          cat1           18
4:        item4          cat2         3092
5:        item5          cat2          731
6:        item6          cat2          189
7:        item7          cat2         1991

I would like to: 我想要：

(1) create a column which is 1 if the value is <= 1000 （1）创建一个列，如果值<= 1000则为1

(2) then number these unique groupings of 1's （2）然后将这些唯一分组编号为1

The resulting data.table would look like this: 结果data.table如下所示：

> dt

  first_column second_column  third_column  labels
0        item1          cat1            50     1
1        item2          cat1            10     1
2        item3          cat1            18     1
3        item4          cat2          3092     0
4        item5          cat2           731     2
5        item6          cat2           189     2
6        item7          cat2          1991     0

This would create a column of all zeros and ones: 这将创建一个全零和一列的列：

dt$new[which(dt$third_column < 1000)] = 1

How would I then label these "groupings" of 1s? 那么我如何标记这些1s的“分组”？

Answer 1

We group by 'second_column, specify the logical condition ( third_column <= 1000 ) in 'i', assign ( := ) the 'labels' as .GRP , then replace the NA values to 0 in the next step 我们按'second_column分组，在'i'中指定逻辑条件（ third_column <= 1000 ），将（标签）分配（ := ）为.GRP ，然后在下一步中将NA值替换为0

dt[third_column<=1000, labels := .GRP , second_column][is.na(labels), labels :=0][]
#     first_column second_column third_column labels
#1:        item1          cat1           50      1
#2:        item2          cat1           10      1
#3:        item3          cat1           18      1
#4:        item4          cat2         3092      0
#5:        item5          cat2          731      2
#6:        item6          cat2          189      2
#7:        item7          cat2         1991      0

Or a second option is more compact by getting the cumulative sum of logical vector ( !duplicated(second_column) ) and multiply it with another logical vector ( third_column <= 1000 ) 或者通过获取逻辑向量的累积和（ !duplicated(second_column) ）并将其与另一个逻辑向量（ third_column <= 1000 ）相乘，第二个选项更紧凑

dt[, labels := cumsum(!duplicated(second_column))*(third_column <= 1000)]
dt
#    first_column second_column third_column labels
#1:        item1          cat1           50      1
#2:        item2          cat1           10      1
#3:        item3          cat1           18      1
#4:        item4          cat2         3092      0
#5:        item5          cat2          731      2
#6:        item6          cat2          189      2
#7:        item7          cat2         1991      0

R data.table：如何“标记”列中的连续值？

问题描述

1 个解决方案

解决方案1
3 已采纳 2017-04-24 20:07:26

R data.table：如何“标记”列中的连续值？

问题描述

1 个解决方案

解决方案1 3 已采纳 2017-04-24 20:07:26

解决方案1
3 已采纳 2017-04-24 20:07:26