简体   繁体   中英

Create columns in R data.table based on sub-strings of existing column

I'm trying to use the following R data.table to create multiple columns out of the "Ref" field:

library(data.table)
(dt= data.table(Ref = c("R", "STOP", "STOP_TS", "P", "M", "STOP_P_R"),
               Qty= c(2,4,6,8,10,12)))

The new columns should be based on single ref only (eg "STOP" and "TS) as opposed to combined ref (eg "STOP_TS"). Once a single ref is identified by using "_" separator, the new column should take the value of the "Qty" field, otherwise it should be zero. The desired output should look like this:

#Desired Output  
  (desired=data.table(
  Ref= c("R", "STOP", "STOP_TS", "P", "M", "STOP_P_R"),
  Qty= c(2,4,6,8,10,12),
  R =  c(2,0,0,0,0,12),
  STOP= c (0,4,6,0,0,12),
  TS= c(0,0,6,0,0,0),
  P= c(0,0,0,8,0,12),
  M=c(0,0,0,0,10,0))) 

The problem I have with my approach is that the regex part wrongly matched "P" when looking at "STOP", since it doesn't specify to match for complete 'words'.

library(foreach)
library(data.table)
ref<-unlist(unique(dt$Ref)) #extract unique combined ref
ref2<-strsplit(ref, "_")    #split ref by using "_"
ref3<-unique(unlist(ref2))  #extract unique single ref (columns to create)

dt2<-foreach(i=1:length(ref3), .combine='cbind')%do%{
  eval(parse(text=paste0("tmp<-ifelse( grepl(ref3[i], dt$Ref), dt$Qty,0)")))
  data.table(tmp)
}
names(dt2)<-ref3
(dt3=cbind(dt,dt2))

As a way to check, the sum of column "P" should be 20 (8 for Ref="P" and 12 for Ref="STOP_P_R").

I'd appreciate any comments or suggestions on this.

dl

An option is to split the column with separate_rows and then reshape it to wide format with pivot_wider , and bind the original dataset with bind_cols

library(dplyr)
library(tidyr)
dt %>% 
   mutate(rn = row_number()) %>% 
   separate_rows(Ref) %>% 
   pivot_wider(names_from = Ref, values_from = Qty, 
       values_fill = list(Qty = 0)) %>%
   select(-rn) %>%
   bind_cols(dt, .)
#        Ref Qty  R STOP TS  P  M
#1:        R   2  2    0  0  0  0
#2:     STOP   4  0    4  0  0  0
#3:  STOP_TS   6  0    6  6  0  0
#4:        P   8  0    0  0  8  0
#5:        M  10  0    0  0  0 10
#6: STOP_P_R  12 12   12  0 12  0

Or using dcast from data.table

library(splitstackshape)
library(data.table)
cbind(dt, dcast(cSplit(dt[, rn := seq_len(.N)], 'Ref', '_', "long"), 
      rn ~ Ref, value.var = 'Qty', fill = 0)[, rn := NULL])

We can use cSplit_e from splitstackshape to get data in binary format for each row separating on "_" . We can then replace all the 1's with the corresponding Qty value.

data <- data.frame(splitstackshape::cSplit_e(dt, "Ref", sep = "_", 
                   type = "character", fill = 0))
cols <- grep('Ref_', names(data))
mat <- which(data[cols] == 1, arr.ind = TRUE)
data[cols][mat] <- data$Qty[mat[, 1]]
data

#       Ref Qty Ref_M Ref_P Ref_R Ref_STOP Ref_TS
#1        R   2     0     0     2        0      0
#2     STOP   4     0     0     0        4      0
#3  STOP_TS   6     0     0     0        6      6
#4        P   8     0     8     0        0      0
#5        M  10    10     0     0        0      0
#6 STOP_P_R  12     0    12    12       12      0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM