如何在R中的数据表中标准化多值列

Question

I have a data.table as below: 我有一个data.table如下：

order   products    value
1000    A|B 10
2000    B|C 20
3000    A|C 30
4000    B|C|D   5
5000    C|D 15

And I need to break the column products and transform/normalize to be used like this: 我需要破坏列乘积并进行变换/归一化，以使其像这样使用：

order   prod.seq    prod.name   value
1000    1   A   10
1000    2   B   10
2000    1   B   20
2000    2   C   20
3000    1   A   30
3000    2   C   30
4000    1   B   5
4000    2   C   5
4000    3   D   5
5000    1   C   15
5000    2   D   15

I guess I can do it using a custom FOR/LOOP but I'd like to know a more advanced way to do that using apply,ddply methods. 我想我可以使用自定义的FOR / LOOP来做到这一点，但我想知道一种更高级的方法，可以使用apply，ddply方法。 Any suggestions? 有什么建议么？

Answer 1

First, convert to a character/string: 首先，转换为字符/字符串：

DT[,products:=as.character(products)]

Then you can split the string: 然后，您可以分割字符串：

DT[,{
  x = strsplit(products,"\\|")[[1]]
  list( prod.seq = seq_along(x), prod_name = x )
}, by=.(order,value)]

which gives 这使

    order value prod.seq prod_name
 1:  1000    10        1         A
 2:  1000    10        2         B
 3:  2000    20        1         B
 4:  2000    20        2         C
 5:  3000    30        1         A
 6:  3000    30        2         C
 7:  4000     5        1         B
 8:  4000     5        2         C
 9:  4000     5        3         D
10:  5000    15        1         C
11:  5000    15        2         D

Answer 2

Here is the another option 这是另一个选择

library(splitstackshape)
out = cSplit(dat, "products", "|", direction = "long")
out[, prod.seq := seq_len(.N), by = value]

#> out
#    order products value prod.seq
# 1:  1000        A    10        1
# 2:  1000        B    10        2
# 3:  2000        B    20        1
# 4:  2000        C    20        2
# 5:  3000        A    30        1
# 6:  3000        C    30        2
# 7:  4000        B     5        1
# 8:  4000        C     5        2
# 9:  4000        D     5        3
#10:  5000        C    15        1
#11:  5000        D    15        2

After cSplit step, using ddply 在cSplit步骤之后，使用ddply

library(plyr)
ddply(out, .(value), mutate, prod.seq = seq_len(length(order)))

using dplyr 使用dplyr

library(dplyr)
out %>% group_by(value) %>% mutate(prod.seq = row_number(order))

using lapply 使用lapply

rbindlist(lapply(split(out, out$value), 
          function(x){x$prod.seq = seq_len(length(x$order));x}))

如何在R中的数据表中标准化多值列

问题描述

2 个解决方案

解决方案1
5 已采纳 2015-07-15 18:24:13

解决方案2
3 2015-07-15 18:50:02

如何在R中的数据表中标准化多值列

问题描述

2 个解决方案

解决方案1 5 已采纳 2015-07-15 18:24:13

解决方案2 3 2015-07-15 18:50:02

解决方案1
5 已采纳 2015-07-15 18:24:13

解决方案2
3 2015-07-15 18:50:02