[英]Add new columns and insert values in columns based on value in another column
我有一個R dataframe data1
,如下所示:
prodID storeID Term Exit
1 1001 5 0
1 1002 4 1
1 1003 3 1
1 1004 5 0
2 1001 4 1
2 1002 3 1
2 1003 5 0
3 1001 4 1
3 1002 3 1
3 1003 5 0
4 1001 4 1
4 1002 3 1
5 1001 5 0
5 1002 4 1
5 1003 3 1
當然,這是我的真實數據的高度簡化格式,大約有300萬行。 我必須執行以下操作:
Term
列中的最大值,在data1
插入具有NA
值的許多列。 列名應Week1
, Week2
, Week3
等等 NA
使用這些規則:1)如果Term
是5,則在插入0 Week1
, Week2
,高達Week4
在和1 Week5
2)如果Term
是4,則在插入0 Week1
, Week2
和Week3
,1 Week4
並保持NA
在Week5
。 等等.... 最終輸出應如下所示:
prodID storeID Term Exit Week1 Week2 Week3 Week4 Week5
1 1001 5 0 0 0 0 0 1
1 1002 4 1 0 0 0 1 NA
1 1003 3 1 0 0 1 NA NA
1 1004 5 0 0 0 0 0 1
2 1001 4 1 0 0 0 1 NA
2 1002 3 1 0 0 1 NA NA
2 1003 5 0 0 0 0 0 1
3 1001 4 1 0 0 0 1 NA
3 1002 3 1 0 0 1 NA NA
3 1003 5 0 0 0 0 0 1
4 1001 4 1 0 0 0 1 NA
4 1002 3 1 0 0 1 NA NA
5 1001 5 0 0 0 0 0 1
5 1002 4 1 0 0 0 1 NA
5 1003 3 1 0 0 1 NA NA
這是我嘗試的:
variant <- c("Week1","Week2","Week3","Week4","Week5")
data1[variant] <- NA
for (i in 1:length(data1$prodID)){
data1$Week1 <- ifelse(data1$Term==1,1,0)
data1$Week2 <- ifelse(data1$Term==2,1,0)
data1$Week3 <- ifelse(data1$Term==3,1,0)
data1$Week4 <- ifelse(data1$Term==4,1,0)
data1$Week5 <- ifelse(data1$Term==5,1,0)
}
這無助於我在所需的單元格中填充NA
。 我想保留NA
值,因為稍后我將在數據幀上進行從寬到長的數據轉換。 而且我知道上述方法在我的龐大數據集中不可行。 任何建議都是最歡迎的。
這是一個主意。 我們可以創建所需的內容,然后拆分列。
library(dplyr)
library(data.table)
library(splitstackshape)
dat2 <- dat %>%
mutate(Week = case_when(
Term == 5 ~"0,0,0,0,1",
Term == 4 ~"0,0,0,1,NA",
Term == 3 ~"0,0,1,NA,NA",
Term == 2 ~"0,1,NA,NA,NA",
Term == 1 ~"1,NA,NA,NA,NA"
)) %>%
cSplit(splitCols = "Week")
dat2
# prodID storeID Term Exit Week_1 Week_2 Week_3 Week_4 Week_5
# 1: 1 1001 5 0 0 0 0 0 1
# 2: 1 1002 4 1 0 0 0 1 NA
# 3: 1 1003 3 1 0 0 1 NA NA
# 4: 1 1004 5 0 0 0 0 0 1
# 5: 2 1001 4 1 0 0 0 1 NA
# 6: 2 1002 3 1 0 0 1 NA NA
# 7: 2 1003 5 0 0 0 0 0 1
# 8: 3 1001 4 1 0 0 0 1 NA
# 9: 3 1002 3 1 0 0 1 NA NA
# 10: 3 1003 5 0 0 0 0 0 1
# 11: 4 1001 4 1 0 0 0 1 NA
# 12: 4 1002 3 1 0 0 1 NA NA
# 13: 5 1001 5 0 0 0 0 0 1
# 14: 5 1002 4 1 0 0 0 1 NA
# 15: 5 1003 3 1 0 0 1 NA NA
或使用此tidyverse
方法。 我比上一個更好,因為這種方法不需要手動輸入星期值。
library(dplyr)
library(tidyr)
library(purrr)
dat2 <- dat %>%
mutate(Week = map2(1, Term, `:`)) %>%
unnest() %>%
group_by(prodID, Term) %>%
mutate(Week_Value = as.integer(Week == max(Week)),
Week = paste0("Week", Week)) %>%
spread(Week, Week_Value) %>%
ungroup()
dat2
# # A tibble: 15 x 9
# prodID storeID Term Exit Week1 Week2 Week3 Week4 Week5
# <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 1 1001 5 0 0 0 0 0 1
# 2 1 1002 4 1 0 0 0 1 NA
# 3 1 1003 3 1 0 0 1 NA NA
# 4 1 1004 5 0 0 0 0 0 1
# 5 2 1001 4 1 0 0 0 1 NA
# 6 2 1002 3 1 0 0 1 NA NA
# 7 2 1003 5 0 0 0 0 0 1
# 8 3 1001 4 1 0 0 0 1 NA
# 9 3 1002 3 1 0 0 1 NA NA
# 10 3 1003 5 0 0 0 0 0 1
# 11 4 1001 4 1 0 0 0 1 NA
# 12 4 1002 3 1 0 0 1 NA NA
# 13 5 1001 5 0 0 0 0 0 1
# 14 5 1002 4 1 0 0 0 1 NA
# 15 5 1003 3 1 0 0 1 NA NA
更新
我們可以使用str_pad
包中的stringr
到pad 0,然后再擴展week列以對列名進行排序。
library(tidyverse)
dat2 <- dat %>%
mutate(Week = map2(1, Term, `:`)) %>%
unnest() %>%
group_by(prodID, Term) %>%
mutate(Week_Value = as.integer(Week == max(Week)),
Week = paste0("Week", str_pad(Week, width = 3, pad = "0"))) %>%
spread(Week, Week_Value) %>%
ungroup()
dat2
# # A tibble: 15 x 9
# prodID storeID Term Exit Week001 Week002 Week003 Week004 Week005
# <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 1 1001 5 0 0 0 0 0 1
# 2 1 1002 4 1 0 0 0 1 NA
# 3 1 1003 3 1 0 0 1 NA NA
# 4 1 1004 5 0 0 0 0 0 1
# 5 2 1001 4 1 0 0 0 1 NA
# 6 2 1002 3 1 0 0 1 NA NA
# 7 2 1003 5 0 0 0 0 0 1
# 8 3 1001 4 1 0 0 0 1 NA
# 9 3 1002 3 1 0 0 1 NA NA
# 10 3 1003 5 0 0 0 0 0 1
# 11 4 1001 4 1 0 0 0 1 NA
# 12 4 1002 3 1 0 0 1 NA NA
# 13 5 1001 5 0 0 0 0 0 1
# 14 5 1002 4 1 0 0 0 1 NA
# 15 5 1003 3 1 0 0 1 NA NA
數據
dat <- read.table(text = "prodID storeID Term Exit
1 1001 5 0
1 1002 4 1
1 1003 3 1
1 1004 5 0
2 1001 4 1
2 1002 3 1
2 1003 5 0
3 1001 4 1
3 1002 3 1
3 1003 5 0
4 1001 4 1
4 1002 3 1
5 1001 5 0
5 1002 4 1
5 1003 3 1",
header = TRUE)
這是一個帶有base R
選項,我們在其中循環遍歷'Term', tabulate
以獲取每個元素的0和1,在NA
的末尾添加length<-
並rbind
list
元素以創建感興趣的列
dat[paste0("Week", 1:5)] <- do.call(rbind, lapply(dat$Term,
function(x) `length<-`(tabulate(x), max(dat$Term))))
dat
# prodID storeID Term Exit Week1 Week2 Week3 Week4 Week5
#1 1 1001 5 0 0 0 0 0 1
#2 1 1002 4 1 0 0 0 1 NA
#3 1 1003 3 1 0 0 1 NA NA
#4 1 1004 5 0 0 0 0 0 1
#5 2 1001 4 1 0 0 0 1 NA
#6 2 1002 3 1 0 0 1 NA NA
#7 2 1003 5 0 0 0 0 0 1
#8 3 1001 4 1 0 0 0 1 NA
#9 3 1002 3 1 0 0 1 NA NA
#10 3 1003 5 0 0 0 0 0 1
#11 4 1001 4 1 0 0 0 1 NA
#12 4 1002 3 1 0 0 1 NA NA
#13 5 1001 5 0 0 0 0 0 1
#14 5 1002 4 1 0 0 0 1 NA
#15 5 1003 3 1 0 0 1 NA NA
或與tidyverse
一起使用類似的方法
library(tidyverse)
dat %>%
mutate(Week = map(Term, ~
tabulate(.x) %>%
as.list %>%
set_names(paste0("Week", seq_along(.))) %>%
as_tibble)) %>%
unnest
# prodID storeID Term Exit Week1 Week2 Week3 Week4 Week5
#1 1 1001 5 0 0 0 0 0 1
#2 1 1002 4 1 0 0 0 1 NA
#3 1 1003 3 1 0 0 1 NA NA
#4 1 1004 5 0 0 0 0 0 1
#5 2 1001 4 1 0 0 0 1 NA
#6 2 1002 3 1 0 0 1 NA NA
#7 2 1003 5 0 0 0 0 0 1
#8 3 1001 4 1 0 0 0 1 NA
#9 3 1002 3 1 0 0 1 NA NA
#10 3 1003 5 0 0 0 0 0 1
#11 4 1001 4 1 0 0 0 1 NA
#12 4 1002 3 1 0 0 1 NA NA
#13 5 1001 5 0 0 0 0 0 1
#14 5 1002 4 1 0 0 0 1 NA
#15 5 1003 3 1 0 0 1 NA NA
使用dplyr::mutate_at
和case_when
的選項可以基於使用quo_name(quo(.))
在column name
查找下標整數,然后檢查列號是否大於/等於/小於Term
值。
# First add additional columns based on maximum value of Term
df[,paste("Week", 1:max(df$Term), sep="")] <- NA
library(dplyr)
df %>% mutate_at(vars(starts_with("Week")), funs(case_when(
as.integer(sub(".*(\\d+)","\\1",quo_name(quo(.)))) < Term ~ 0L,
as.integer(sub(".*(\\d+)","\\1",quo_name(quo(.)))) == Term ~ 1L,
TRUE ~ NA_integer_
)))
# prodID storeID Term Exit Week1 Week2 Week3 Week4 Week5
# 1 1 1001 5 0 0 0 0 0 1
# 2 1 1002 4 1 0 0 0 1 NA
# 3 1 1003 3 1 0 0 1 NA NA
# 4 1 1004 5 0 0 0 0 0 1
# 5 2 1001 4 1 0 0 0 1 NA
# 6 2 1002 3 1 0 0 1 NA NA
# 7 2 1003 5 0 0 0 0 0 1
# 8 3 1001 4 1 0 0 0 1 NA
# 9 3 1002 3 1 0 0 1 NA NA
# 10 3 1003 5 0 0 0 0 0 1
# 11 4 1001 4 1 0 0 0 1 NA
# 12 4 1002 3 1 0 0 1 NA NA
# 13 5 1001 5 0 0 0 0 0 1
# 14 5 1002 4 1 0 0 0 1 NA
# 15 5 1003 3 1 0 0 1 NA NA
數據:
df <- read.table(text="
prodID storeID Term Exit
1 1001 5 0
1 1002 4 1
1 1003 3 1
1 1004 5 0
2 1001 4 1
2 1002 3 1
2 1003 5 0
3 1001 4 1
3 1002 3 1
3 1003 5 0
4 1001 4 1
4 1002 3 1
5 1001 5 0
5 1002 4 1
5 1003 3 1",
header = TRUE, stringsAsFactors = FALSE)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.