[英]How to normalize multiple-values-column in a data table in R
I have a data.table as below: 我有一个data.table如下:
order products value
1000 A|B 10
2000 B|C 20
3000 A|C 30
4000 B|C|D 5
5000 C|D 15
And I need to break the column products and transform/normalize to be used like this: 我需要破坏列乘积并进行变换/归一化,以使其像这样使用:
order prod.seq prod.name value
1000 1 A 10
1000 2 B 10
2000 1 B 20
2000 2 C 20
3000 1 A 30
3000 2 C 30
4000 1 B 5
4000 2 C 5
4000 3 D 5
5000 1 C 15
5000 2 D 15
I guess I can do it using a custom FOR/LOOP but I'd like to know a more advanced way to do that using apply,ddply methods. 我想我可以使用自定义的FOR / LOOP来做到这一点,但我想知道一种更高级的方法,可以使用apply,ddply方法。 Any suggestions? 有什么建议么?
First, convert to a character/string: 首先,转换为字符/字符串:
DT[,products:=as.character(products)]
Then you can split the string: 然后,您可以分割字符串:
DT[,{
x = strsplit(products,"\\|")[[1]]
list( prod.seq = seq_along(x), prod_name = x )
}, by=.(order,value)]
which gives 这使
order value prod.seq prod_name
1: 1000 10 1 A
2: 1000 10 2 B
3: 2000 20 1 B
4: 2000 20 2 C
5: 3000 30 1 A
6: 3000 30 2 C
7: 4000 5 1 B
8: 4000 5 2 C
9: 4000 5 3 D
10: 5000 15 1 C
11: 5000 15 2 D
Here is the another option 这是另一个选择
library(splitstackshape)
out = cSplit(dat, "products", "|", direction = "long")
out[, prod.seq := seq_len(.N), by = value]
#> out
# order products value prod.seq
# 1: 1000 A 10 1
# 2: 1000 B 10 2
# 3: 2000 B 20 1
# 4: 2000 C 20 2
# 5: 3000 A 30 1
# 6: 3000 C 30 2
# 7: 4000 B 5 1
# 8: 4000 C 5 2
# 9: 4000 D 5 3
#10: 5000 C 15 1
#11: 5000 D 15 2
After cSplit
step, using ddply
在cSplit
步骤之后,使用ddply
library(plyr)
ddply(out, .(value), mutate, prod.seq = seq_len(length(order)))
using dplyr
使用dplyr
library(dplyr)
out %>% group_by(value) %>% mutate(prod.seq = row_number(order))
using lapply
使用lapply
rbindlist(lapply(split(out, out$value),
function(x){x$prod.seq = seq_len(length(x$order));x}))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.