简体   繁体   English

添加新列并根据另一列中的值在列中插入值

[英]Add new columns and insert values in columns based on value in another column

I have a R dataframe data1 as below: 我有一个R dataframe data1 ,如下所示:

prodID   storeID   Term    Exit
1        1001      5       0
1        1002      4       1
1        1003      3       1
1        1004      5       0
2        1001      4       1
2        1002      3       1
2        1003      5       0
3        1001      4       1
3        1002      3       1
3        1003      5       0
4        1001      4       1
4        1002      3       1
5        1001      5       0
5        1002      4       1
5        1003      3       1

This is of course highly simplified format of my real data which runs into around 3 million rows. 当然,这是我的真实数据的高度简化格式,大约有300万行。 I have to do the following: 我必须执行以下操作:

  1. Based on max value in Term column, insert that many columns in data1 with NA values. 根据Term列中的最大值,在data1插入具有NA值的许多列。 Column names should be Week1 , Week2 , Week3 , etc 列名应Week1Week2Week3等等
  2. For each row, fill the new columns with 0, 1 or NA using these rules: 1)If Term is 5, then insert 0 in Week1 , Week2 , upto Week4 and 1 in Week5 2) If Term is 4 then insert 0 in Week1 , Week2 , and Week3 , 1 in Week4 and keep NA in Week5 . 对于每一行,填充新的列与0,1或NA使用这些规则:1)如果Term是5,则在插入0 Week1Week2 ,高达Week4在和1 Week5 2)如果Term是4,则在插入0 Week1Week2Week3 ,1 Week4并保持NAWeek5 And so on.... 等等....

The final output should look like: 最终输出应如下所示:

prodID   storeID   Term    Exit  Week1   Week2   Week3   Week4   Week5
1        1001      5       0     0       0       0       0       1
1        1002      4       1     0       0       0       1       NA
1        1003      3       1     0       0       1       NA      NA
1        1004      5       0     0       0       0       0       1
2        1001      4       1     0       0       0       1       NA
2        1002      3       1     0       0       1       NA      NA
2        1003      5       0     0       0       0       0       1
3        1001      4       1     0       0       0       1       NA
3        1002      3       1     0       0       1       NA      NA
3        1003      5       0     0       0       0       0       1
4        1001      4       1     0       0       0       1       NA
4        1002      3       1     0       0       1       NA      NA
5        1001      5       0     0       0       0       0       1
5        1002      4       1     0       0       0       1       NA
5        1003      3       1     0       0       1       NA      NA

This is what I tried: 这是我尝试的:

variant <- c("Week1","Week2","Week3","Week4","Week5")

data1[variant] <- NA

for (i in 1:length(data1$prodID)){
  data1$Week1 <- ifelse(data1$Term==1,1,0)
  data1$Week2 <- ifelse(data1$Term==2,1,0)
  data1$Week3 <- ifelse(data1$Term==3,1,0)
  data1$Week4 <- ifelse(data1$Term==4,1,0)
  data1$Week5 <- ifelse(data1$Term==5,1,0)
}

This doesn't help me populate NA in the required cells. 这无助于我在所需的单元格中填充NA I would like to retain the NA values because I am going to do a wide to long data transformation on the data frame later on. 我想保留NA值,因为稍后我将在数据帧上进行从宽到长的数据转换。 And I know the above approach is not feasible in my huge dataset. 而且我知道上述方法在我的庞大数据集中不可行。 Any suggestions are most welcome. 任何建议都是最欢迎的。

Here is one idea. 这是一个主意。 We can create the content you need and then split the columns. 我们可以创建所需的内容,然后拆分列。

library(dplyr)
library(data.table)
library(splitstackshape)

dat2 <- dat %>%
  mutate(Week = case_when(
    Term == 5       ~"0,0,0,0,1",
    Term == 4       ~"0,0,0,1,NA",
    Term == 3       ~"0,0,1,NA,NA",
    Term == 2       ~"0,1,NA,NA,NA",
    Term == 1       ~"1,NA,NA,NA,NA"
  )) %>%
  cSplit(splitCols = "Week")
dat2
#     prodID storeID Term Exit Week_1 Week_2 Week_3 Week_4 Week_5
#  1:      1    1001    5    0      0      0      0      0      1
#  2:      1    1002    4    1      0      0      0      1     NA
#  3:      1    1003    3    1      0      0      1     NA     NA
#  4:      1    1004    5    0      0      0      0      0      1
#  5:      2    1001    4    1      0      0      0      1     NA
#  6:      2    1002    3    1      0      0      1     NA     NA
#  7:      2    1003    5    0      0      0      0      0      1
#  8:      3    1001    4    1      0      0      0      1     NA
#  9:      3    1002    3    1      0      0      1     NA     NA
# 10:      3    1003    5    0      0      0      0      0      1
# 11:      4    1001    4    1      0      0      0      1     NA
# 12:      4    1002    3    1      0      0      1     NA     NA
# 13:      5    1001    5    0      0      0      0      0      1
# 14:      5    1002    4    1      0      0      0      1     NA
# 15:      5    1003    3    1      0      0      1     NA     NA

Or use this tidyverse method. 或使用此tidyverse方法。 I like this one better than my previous one because this method does not require manually typing the week values. 我比上一个更好,因为这种方法不需要手动输入星期值。

library(dplyr)
library(tidyr)
library(purrr)

dat2 <- dat %>%
  mutate(Week = map2(1, Term, `:`)) %>%
  unnest() %>%
  group_by(prodID, Term) %>%
  mutate(Week_Value = as.integer(Week == max(Week)),
         Week = paste0("Week", Week)) %>%
  spread(Week, Week_Value) %>%
  ungroup()
dat2
# # A tibble: 15 x 9
#    prodID storeID  Term  Exit Week1 Week2 Week3 Week4 Week5
#     <int>   <int> <int> <int> <int> <int> <int> <int> <int>
#  1      1    1001     5     0     0     0     0     0     1
#  2      1    1002     4     1     0     0     0     1    NA
#  3      1    1003     3     1     0     0     1    NA    NA
#  4      1    1004     5     0     0     0     0     0     1
#  5      2    1001     4     1     0     0     0     1    NA
#  6      2    1002     3     1     0     0     1    NA    NA
#  7      2    1003     5     0     0     0     0     0     1
#  8      3    1001     4     1     0     0     0     1    NA
#  9      3    1002     3     1     0     0     1    NA    NA
# 10      3    1003     5     0     0     0     0     0     1
# 11      4    1001     4     1     0     0     0     1    NA
# 12      4    1002     3     1     0     0     1    NA    NA
# 13      5    1001     5     0     0     0     0     0     1
# 14      5    1002     4     1     0     0     0     1    NA
# 15      5    1003     3     1     0     0     1    NA    NA

UPDATE 更新

We can use str_pad from the stringr package to pad 0 before spread the week column to sort the column name. 我们可以使用str_pad包中的stringr到pad 0,然后再扩展week列以对列名进行排序。

library(tidyverse)

dat2 <- dat %>%
  mutate(Week = map2(1, Term, `:`)) %>%
  unnest() %>%
  group_by(prodID, Term) %>%
  mutate(Week_Value = as.integer(Week == max(Week)),
         Week = paste0("Week", str_pad(Week, width = 3, pad = "0"))) %>%
  spread(Week, Week_Value) %>%
  ungroup()
dat2
# # A tibble: 15 x 9
#   prodID storeID  Term  Exit Week001 Week002 Week003 Week004 Week005
#     <int>   <int> <int> <int>   <int>   <int>   <int>   <int>   <int>
#  1      1    1001     5     0       0       0       0       0       1
#  2      1    1002     4     1       0       0       0       1      NA
#  3      1    1003     3     1       0       0       1      NA      NA
#  4      1    1004     5     0       0       0       0       0       1
#  5      2    1001     4     1       0       0       0       1      NA
#  6      2    1002     3     1       0       0       1      NA      NA
#  7      2    1003     5     0       0       0       0       0       1
#  8      3    1001     4     1       0       0       0       1      NA
#  9      3    1002     3     1       0       0       1      NA      NA
# 10      3    1003     5     0       0       0       0       0       1
# 11      4    1001     4     1       0       0       0       1      NA
# 12      4    1002     3     1       0       0       1      NA      NA
# 13      5    1001     5     0       0       0       0       0       1
# 14      5    1002     4     1       0       0       0       1      NA
# 15      5    1003     3     1       0       0       1      NA      NA

DATA 数据

dat <- read.table(text = "prodID   storeID   Term    Exit
1        1001      5       0
                  1        1002      4       1
                  1        1003      3       1
                  1        1004      5       0
                  2        1001      4       1
                  2        1002      3       1
                  2        1003      5       0
                  3        1001      4       1
                  3        1002      3       1
                  3        1003      5       0
                  4        1001      4       1
                  4        1002      3       1
                  5        1001      5       0
                  5        1002      4       1
                  5        1003      3       1",
                  header = TRUE)

Here is one option with base R where we loop through the 'Term', tabulate to get a 0s and 1 for each element, append NA at the end with length<- and rbind the list elements to create the columns of interest 这是一个带有base R选项,我们在其中循环遍历'Term', tabulate以获取每个元素的0和1,在NA的末尾添加length<-rbind list元素以创建感兴趣的列

dat[paste0("Week", 1:5)] <- do.call(rbind, lapply(dat$Term,
                  function(x) `length<-`(tabulate(x), max(dat$Term))))
dat
#   prodID storeID Term Exit Week1 Week2 Week3 Week4 Week5
#1       1    1001    5    0     0     0     0     0     1
#2       1    1002    4    1     0     0     0     1    NA
#3       1    1003    3    1     0     0     1    NA    NA
#4       1    1004    5    0     0     0     0     0     1
#5       2    1001    4    1     0     0     0     1    NA
#6       2    1002    3    1     0     0     1    NA    NA
#7       2    1003    5    0     0     0     0     0     1
#8       3    1001    4    1     0     0     0     1    NA
#9       3    1002    3    1     0     0     1    NA    NA
#10      3    1003    5    0     0     0     0     0     1
#11      4    1001    4    1     0     0     0     1    NA
#12      4    1002    3    1     0     0     1    NA    NA
#13      5    1001    5    0     0     0     0     0     1
#14      5    1002    4    1     0     0     0     1    NA
#15      5    1003    3    1     0     0     1    NA    NA

Or using the similar approach with tidyverse 或与tidyverse一起使用类似的方法

library(tidyverse)
dat %>% 
  mutate(Week = map(Term, ~ 
                            tabulate(.x) %>% 
                            as.list %>% 
                            set_names(paste0("Week", seq_along(.))) %>% 
                            as_tibble)) %>% 
  unnest 
#   prodID storeID Term Exit Week1 Week2 Week3 Week4 Week5
#1       1    1001    5    0     0     0     0     0     1
#2       1    1002    4    1     0     0     0     1    NA
#3       1    1003    3    1     0     0     1    NA    NA
#4       1    1004    5    0     0     0     0     0     1
#5       2    1001    4    1     0     0     0     1    NA
#6       2    1002    3    1     0     0     1    NA    NA
#7       2    1003    5    0     0     0     0     0     1
#8       3    1001    4    1     0     0     0     1    NA
#9       3    1002    3    1     0     0     1    NA    NA
#10      3    1003    5    0     0     0     0     0     1
#11      4    1001    4    1     0     0     0     1    NA
#12      4    1002    3    1     0     0     1    NA    NA
#13      5    1001    5    0     0     0     0     0     1
#14      5    1002    4    1     0     0     0     1    NA
#15      5    1003    3    1     0     0     1    NA    NA

An option using dplyr::mutate_at and case_when can be based on finding subscript integer in column name using quo_name(quo(.)) and then checking if column number is more/equal/less than value of Term . 使用dplyr::mutate_atcase_when的选项可以基于使用quo_name(quo(.))column name查找下标整数,然后检查列号是否大于/等于/小于Term值。

# First add additional columns based on maximum value of Term
df[,paste("Week", 1:max(df$Term), sep="")] <- NA

library(dplyr)

df %>% mutate_at(vars(starts_with("Week")), funs(case_when(
  as.integer(sub(".*(\\d+)","\\1",quo_name(quo(.)))) < Term ~ 0L,
  as.integer(sub(".*(\\d+)","\\1",quo_name(quo(.)))) == Term ~ 1L,
  TRUE                                                      ~ NA_integer_
)))

#    prodID storeID Term Exit Week1 Week2 Week3 Week4 Week5
# 1       1    1001    5    0     0     0     0     0     1
# 2       1    1002    4    1     0     0     0     1    NA
# 3       1    1003    3    1     0     0     1    NA    NA
# 4       1    1004    5    0     0     0     0     0     1
# 5       2    1001    4    1     0     0     0     1    NA
# 6       2    1002    3    1     0     0     1    NA    NA
# 7       2    1003    5    0     0     0     0     0     1
# 8       3    1001    4    1     0     0     0     1    NA
# 9       3    1002    3    1     0     0     1    NA    NA
# 10      3    1003    5    0     0     0     0     0     1
# 11      4    1001    4    1     0     0     0     1    NA
# 12      4    1002    3    1     0     0     1    NA    NA
# 13      5    1001    5    0     0     0     0     0     1
# 14      5    1002    4    1     0     0     0     1    NA
# 15      5    1003    3    1     0     0     1    NA    NA

Data: 数据:

df <- read.table(text="
prodID   storeID   Term    Exit
1        1001      5       0
1        1002      4       1
1        1003      3       1
1        1004      5       0
2        1001      4       1
2        1002      3       1
2        1003      5       0
3        1001      4       1
3        1002      3       1
3        1003      5       0
4        1001      4       1
4        1002      3       1
5        1001      5       0
5        1002      4       1
5        1003      3       1",
header = TRUE, stringsAsFactors = FALSE)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 根据其他列中的值添加新列 - Add new columns based on values in other columns 基于另一列的值的新DF列。 通过检查具有最高值的预先存在的列的名称获得的可能的新值 - New DF column based on the value of another column. Possible new values obtained by checking the name of pre-existing columns with highest value 根据另一个 dataframe 的多个列向 dataframe 添加新列 - Add a new column to a dataframe based on multiple columns from another dataframe 折叠数据框,创建新列,名称是另一列的唯一值,值基于另一列的值? 在 R - Collapse a dataframe, creating new columns with name being the unique values of another column, and value based on the value of another column? In R 根据对三列 R 的数学计算将值添加到新列 - Add values to a new column based on math calculations on three columns R 根据其他列中的值添加新的 data.frame 列 - Add new data.frame column based on values in other columns 基于匹配不同列中先前值的新列值 - New column value based on matching previous values in different columns 基于类似称为列中的值的新列的属性值 - Attribute value to new column based on values in similarly called columns 根据同一数据框中另一列的值对2列的值进行计数或求和 - Count or sum the values of 2 columns based on the value of another column in the same dataframe 为另一列中的每个唯一值向数据框添加新列 - Add new columns to data frame for each unique value in another column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM