[英]Add new columns and insert values in columns based on value in another column
I have a R dataframe data1
as below: 我有一个R dataframe
data1
,如下所示:
prodID storeID Term Exit
1 1001 5 0
1 1002 4 1
1 1003 3 1
1 1004 5 0
2 1001 4 1
2 1002 3 1
2 1003 5 0
3 1001 4 1
3 1002 3 1
3 1003 5 0
4 1001 4 1
4 1002 3 1
5 1001 5 0
5 1002 4 1
5 1003 3 1
This is of course highly simplified format of my real data which runs into around 3 million rows. 当然,这是我的真实数据的高度简化格式,大约有300万行。 I have to do the following:
我必须执行以下操作:
Term
column, insert that many columns in data1
with NA
values. Term
列中的最大值,在data1
插入具有NA
值的许多列。 Column names should be Week1
, Week2
, Week3
, etc Week1
, Week2
, Week3
等等 NA
using these rules: 1)If Term
is 5, then insert 0 in Week1
, Week2
, upto Week4
and 1 in Week5
2) If Term
is 4 then insert 0 in Week1
, Week2
, and Week3
, 1 in Week4
and keep NA
in Week5
. NA
使用这些规则:1)如果Term
是5,则在插入0 Week1
, Week2
,高达Week4
在和1 Week5
2)如果Term
是4,则在插入0 Week1
, Week2
和Week3
,1 Week4
并保持NA
在Week5
。 And so on.... The final output should look like: 最终输出应如下所示:
prodID storeID Term Exit Week1 Week2 Week3 Week4 Week5
1 1001 5 0 0 0 0 0 1
1 1002 4 1 0 0 0 1 NA
1 1003 3 1 0 0 1 NA NA
1 1004 5 0 0 0 0 0 1
2 1001 4 1 0 0 0 1 NA
2 1002 3 1 0 0 1 NA NA
2 1003 5 0 0 0 0 0 1
3 1001 4 1 0 0 0 1 NA
3 1002 3 1 0 0 1 NA NA
3 1003 5 0 0 0 0 0 1
4 1001 4 1 0 0 0 1 NA
4 1002 3 1 0 0 1 NA NA
5 1001 5 0 0 0 0 0 1
5 1002 4 1 0 0 0 1 NA
5 1003 3 1 0 0 1 NA NA
This is what I tried: 这是我尝试的:
variant <- c("Week1","Week2","Week3","Week4","Week5")
data1[variant] <- NA
for (i in 1:length(data1$prodID)){
data1$Week1 <- ifelse(data1$Term==1,1,0)
data1$Week2 <- ifelse(data1$Term==2,1,0)
data1$Week3 <- ifelse(data1$Term==3,1,0)
data1$Week4 <- ifelse(data1$Term==4,1,0)
data1$Week5 <- ifelse(data1$Term==5,1,0)
}
This doesn't help me populate NA
in the required cells. 这无助于我在所需的单元格中填充
NA
。 I would like to retain the NA
values because I am going to do a wide to long data transformation on the data frame later on. 我想保留
NA
值,因为稍后我将在数据帧上进行从宽到长的数据转换。 And I know the above approach is not feasible in my huge dataset. 而且我知道上述方法在我的庞大数据集中不可行。 Any suggestions are most welcome.
任何建议都是最欢迎的。
Here is one idea. 这是一个主意。 We can create the content you need and then split the columns.
我们可以创建所需的内容,然后拆分列。
library(dplyr)
library(data.table)
library(splitstackshape)
dat2 <- dat %>%
mutate(Week = case_when(
Term == 5 ~"0,0,0,0,1",
Term == 4 ~"0,0,0,1,NA",
Term == 3 ~"0,0,1,NA,NA",
Term == 2 ~"0,1,NA,NA,NA",
Term == 1 ~"1,NA,NA,NA,NA"
)) %>%
cSplit(splitCols = "Week")
dat2
# prodID storeID Term Exit Week_1 Week_2 Week_3 Week_4 Week_5
# 1: 1 1001 5 0 0 0 0 0 1
# 2: 1 1002 4 1 0 0 0 1 NA
# 3: 1 1003 3 1 0 0 1 NA NA
# 4: 1 1004 5 0 0 0 0 0 1
# 5: 2 1001 4 1 0 0 0 1 NA
# 6: 2 1002 3 1 0 0 1 NA NA
# 7: 2 1003 5 0 0 0 0 0 1
# 8: 3 1001 4 1 0 0 0 1 NA
# 9: 3 1002 3 1 0 0 1 NA NA
# 10: 3 1003 5 0 0 0 0 0 1
# 11: 4 1001 4 1 0 0 0 1 NA
# 12: 4 1002 3 1 0 0 1 NA NA
# 13: 5 1001 5 0 0 0 0 0 1
# 14: 5 1002 4 1 0 0 0 1 NA
# 15: 5 1003 3 1 0 0 1 NA NA
Or use this tidyverse
method. 或使用此
tidyverse
方法。 I like this one better than my previous one because this method does not require manually typing the week values. 我比上一个更好,因为这种方法不需要手动输入星期值。
library(dplyr)
library(tidyr)
library(purrr)
dat2 <- dat %>%
mutate(Week = map2(1, Term, `:`)) %>%
unnest() %>%
group_by(prodID, Term) %>%
mutate(Week_Value = as.integer(Week == max(Week)),
Week = paste0("Week", Week)) %>%
spread(Week, Week_Value) %>%
ungroup()
dat2
# # A tibble: 15 x 9
# prodID storeID Term Exit Week1 Week2 Week3 Week4 Week5
# <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 1 1001 5 0 0 0 0 0 1
# 2 1 1002 4 1 0 0 0 1 NA
# 3 1 1003 3 1 0 0 1 NA NA
# 4 1 1004 5 0 0 0 0 0 1
# 5 2 1001 4 1 0 0 0 1 NA
# 6 2 1002 3 1 0 0 1 NA NA
# 7 2 1003 5 0 0 0 0 0 1
# 8 3 1001 4 1 0 0 0 1 NA
# 9 3 1002 3 1 0 0 1 NA NA
# 10 3 1003 5 0 0 0 0 0 1
# 11 4 1001 4 1 0 0 0 1 NA
# 12 4 1002 3 1 0 0 1 NA NA
# 13 5 1001 5 0 0 0 0 0 1
# 14 5 1002 4 1 0 0 0 1 NA
# 15 5 1003 3 1 0 0 1 NA NA
UPDATE 更新
We can use str_pad
from the stringr
package to pad 0 before spread the week column to sort the column name. 我们可以使用
str_pad
包中的stringr
到pad 0,然后再扩展week列以对列名进行排序。
library(tidyverse)
dat2 <- dat %>%
mutate(Week = map2(1, Term, `:`)) %>%
unnest() %>%
group_by(prodID, Term) %>%
mutate(Week_Value = as.integer(Week == max(Week)),
Week = paste0("Week", str_pad(Week, width = 3, pad = "0"))) %>%
spread(Week, Week_Value) %>%
ungroup()
dat2
# # A tibble: 15 x 9
# prodID storeID Term Exit Week001 Week002 Week003 Week004 Week005
# <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 1 1001 5 0 0 0 0 0 1
# 2 1 1002 4 1 0 0 0 1 NA
# 3 1 1003 3 1 0 0 1 NA NA
# 4 1 1004 5 0 0 0 0 0 1
# 5 2 1001 4 1 0 0 0 1 NA
# 6 2 1002 3 1 0 0 1 NA NA
# 7 2 1003 5 0 0 0 0 0 1
# 8 3 1001 4 1 0 0 0 1 NA
# 9 3 1002 3 1 0 0 1 NA NA
# 10 3 1003 5 0 0 0 0 0 1
# 11 4 1001 4 1 0 0 0 1 NA
# 12 4 1002 3 1 0 0 1 NA NA
# 13 5 1001 5 0 0 0 0 0 1
# 14 5 1002 4 1 0 0 0 1 NA
# 15 5 1003 3 1 0 0 1 NA NA
DATA 数据
dat <- read.table(text = "prodID storeID Term Exit
1 1001 5 0
1 1002 4 1
1 1003 3 1
1 1004 5 0
2 1001 4 1
2 1002 3 1
2 1003 5 0
3 1001 4 1
3 1002 3 1
3 1003 5 0
4 1001 4 1
4 1002 3 1
5 1001 5 0
5 1002 4 1
5 1003 3 1",
header = TRUE)
Here is one option with base R
where we loop through the 'Term', tabulate
to get a 0s and 1 for each element, append NA
at the end with length<-
and rbind
the list
elements to create the columns of interest 这是一个带有
base R
选项,我们在其中循环遍历'Term', tabulate
以获取每个元素的0和1,在NA
的末尾添加length<-
并rbind
list
元素以创建感兴趣的列
dat[paste0("Week", 1:5)] <- do.call(rbind, lapply(dat$Term,
function(x) `length<-`(tabulate(x), max(dat$Term))))
dat
# prodID storeID Term Exit Week1 Week2 Week3 Week4 Week5
#1 1 1001 5 0 0 0 0 0 1
#2 1 1002 4 1 0 0 0 1 NA
#3 1 1003 3 1 0 0 1 NA NA
#4 1 1004 5 0 0 0 0 0 1
#5 2 1001 4 1 0 0 0 1 NA
#6 2 1002 3 1 0 0 1 NA NA
#7 2 1003 5 0 0 0 0 0 1
#8 3 1001 4 1 0 0 0 1 NA
#9 3 1002 3 1 0 0 1 NA NA
#10 3 1003 5 0 0 0 0 0 1
#11 4 1001 4 1 0 0 0 1 NA
#12 4 1002 3 1 0 0 1 NA NA
#13 5 1001 5 0 0 0 0 0 1
#14 5 1002 4 1 0 0 0 1 NA
#15 5 1003 3 1 0 0 1 NA NA
Or using the similar approach with tidyverse
或与
tidyverse
一起使用类似的方法
library(tidyverse)
dat %>%
mutate(Week = map(Term, ~
tabulate(.x) %>%
as.list %>%
set_names(paste0("Week", seq_along(.))) %>%
as_tibble)) %>%
unnest
# prodID storeID Term Exit Week1 Week2 Week3 Week4 Week5
#1 1 1001 5 0 0 0 0 0 1
#2 1 1002 4 1 0 0 0 1 NA
#3 1 1003 3 1 0 0 1 NA NA
#4 1 1004 5 0 0 0 0 0 1
#5 2 1001 4 1 0 0 0 1 NA
#6 2 1002 3 1 0 0 1 NA NA
#7 2 1003 5 0 0 0 0 0 1
#8 3 1001 4 1 0 0 0 1 NA
#9 3 1002 3 1 0 0 1 NA NA
#10 3 1003 5 0 0 0 0 0 1
#11 4 1001 4 1 0 0 0 1 NA
#12 4 1002 3 1 0 0 1 NA NA
#13 5 1001 5 0 0 0 0 0 1
#14 5 1002 4 1 0 0 0 1 NA
#15 5 1003 3 1 0 0 1 NA NA
An option using dplyr::mutate_at
and case_when
can be based on finding subscript integer in column name
using quo_name(quo(.))
and then checking if column number is more/equal/less than value of Term
. 使用
dplyr::mutate_at
和case_when
的选项可以基于使用quo_name(quo(.))
在column name
查找下标整数,然后检查列号是否大于/等于/小于Term
值。
# First add additional columns based on maximum value of Term
df[,paste("Week", 1:max(df$Term), sep="")] <- NA
library(dplyr)
df %>% mutate_at(vars(starts_with("Week")), funs(case_when(
as.integer(sub(".*(\\d+)","\\1",quo_name(quo(.)))) < Term ~ 0L,
as.integer(sub(".*(\\d+)","\\1",quo_name(quo(.)))) == Term ~ 1L,
TRUE ~ NA_integer_
)))
# prodID storeID Term Exit Week1 Week2 Week3 Week4 Week5
# 1 1 1001 5 0 0 0 0 0 1
# 2 1 1002 4 1 0 0 0 1 NA
# 3 1 1003 3 1 0 0 1 NA NA
# 4 1 1004 5 0 0 0 0 0 1
# 5 2 1001 4 1 0 0 0 1 NA
# 6 2 1002 3 1 0 0 1 NA NA
# 7 2 1003 5 0 0 0 0 0 1
# 8 3 1001 4 1 0 0 0 1 NA
# 9 3 1002 3 1 0 0 1 NA NA
# 10 3 1003 5 0 0 0 0 0 1
# 11 4 1001 4 1 0 0 0 1 NA
# 12 4 1002 3 1 0 0 1 NA NA
# 13 5 1001 5 0 0 0 0 0 1
# 14 5 1002 4 1 0 0 0 1 NA
# 15 5 1003 3 1 0 0 1 NA NA
Data: 数据:
df <- read.table(text="
prodID storeID Term Exit
1 1001 5 0
1 1002 4 1
1 1003 3 1
1 1004 5 0
2 1001 4 1
2 1002 3 1
2 1003 5 0
3 1001 4 1
3 1002 3 1
3 1003 5 0
4 1001 4 1
4 1002 3 1
5 1001 5 0
5 1002 4 1
5 1003 3 1",
header = TRUE, stringsAsFactors = FALSE)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.