简体   繁体   English

R data.table:如何“标记”列中的连续值?

[英]R data.table: How to “label” consecutive values in a column?

I have the following data.table (though it's ok if you use it as a data.frame) 我有以下data.table(如果你把它用作data.frame就可以了)

library(data.table)

dt <- data.table(first_column = c("item1", "item2", "item3", "item4", "item5", "item6", "item7"),
second_column = c("cat1", "cat1", "cat1", "cat2", "cat2", "cat2", "cat2"), third_column = c(50, 10, 18, 3092, 731, 189, 1991))

> dt
   first_column second_column third_column
1:        item1          cat1           50
2:        item2          cat1           10
3:        item3          cat1           18
4:        item4          cat2         3092
5:        item5          cat2          731
6:        item6          cat2          189
7:        item7          cat2         1991

I would like to: 我想要:

(1) create a column which is 1 if the value is <= 1000 (1)创建一个列,如果值<= 1000则为1

(2) then number these unique groupings of 1's (2)然后将这些唯一分组编号为1

The resulting data.table would look like this: 结果data.table如下所示:

> dt

  first_column second_column  third_column  labels
0        item1          cat1            50     1
1        item2          cat1            10     1
2        item3          cat1            18     1
3        item4          cat2          3092     0
4        item5          cat2           731     2
5        item6          cat2           189     2
6        item7          cat2          1991     0

This would create a column of all zeros and ones: 这将创建一个全零和一列的列:

dt$new[which(dt$third_column < 1000)] = 1

How would I then label these "groupings" of 1s? 那么我如何标记这些1s的“分组”?

We group by 'second_column, specify the logical condition ( third_column <= 1000 ) in 'i', assign ( := ) the 'labels' as .GRP , then replace the NA values to 0 in the next step 我们按'second_column分组,在'i'中指定逻辑条件( third_column <= 1000 ),将(标签)分配( := )为.GRP ,然后在下一步中将NA值替换为0

dt[third_column<=1000, labels := .GRP , second_column][is.na(labels), labels :=0][]
#     first_column second_column third_column labels
#1:        item1          cat1           50      1
#2:        item2          cat1           10      1
#3:        item3          cat1           18      1
#4:        item4          cat2         3092      0
#5:        item5          cat2          731      2
#6:        item6          cat2          189      2
#7:        item7          cat2         1991      0

Or a second option is more compact by getting the cumulative sum of logical vector ( !duplicated(second_column) ) and multiply it with another logical vector ( third_column <= 1000 ) 或者通过获取逻辑向量的累积和( !duplicated(second_column) )并将其与另一个逻辑向量( third_column <= 1000 )相乘,第二个选项更紧凑

dt[, labels := cumsum(!duplicated(second_column))*(third_column <= 1000)]
dt
#    first_column second_column third_column labels
#1:        item1          cat1           50      1
#2:        item2          cat1           10      1
#3:        item3          cat1           18      1
#4:        item4          cat2         3092      0
#5:        item5          cat2          731      2
#6:        item6          cat2          189      2
#7:        item7          cat2         1991      0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM