[英]Create nominal variable from multiple columns R
My intention involves creating a variable based on the values of two numeric ones. 我的意图是基于两个数字值创建一个变量。 I have not written any user-defined functions in R and need help getting started.
我尚未在R中编写任何用户定义的函数,因此需要入门帮助。
Dataset: 数据集:
My dataset has over 3k stores, but created a reproducible example of the first 10 rows. 我的数据集有3k多家商店,但是创建了前10行的可复制示例。 Deliveries per day of week show total volume for that day through the year.
一周中每天的发货量显示了全年中该天的总量。
Store_num
represents store number and Total
shows the total deliveries for a store throughout year. Store_num
代表商店编号, Total
显示一年中商店的总交付量。
I want predominant delivery days created in a variable called Del_Sch
with the following inequalities. 我想在一个名为
Del_Sch
的变量中创建具有以下不等式的主要交货天数。 If first condition TRUE (50-100%), then create the variable with the column name. 如果第一个条件为TRUE(50-100%),则使用列名创建变量。 If FALSE, test second condition and create variable with all column names between 32-50%, ect.
如果为FALSE,则测试第二个条件,并创建所有列名称在32-50%之间的变量。 If there are no days over 20%, no predominant delivery days are counted.
如果没有超过20%的天数,则不计入主要交付天数。
-Volume in a day between 50-100% of the total. -每天的总量在50-100%之间。
-Volume in a day between 32-50% of total -一天中的总量的32-50%
-Volume in a day between 25-32% of total. -每天的交易量介于总交易量的25-32%之间。
-Volume in a day between 20-25% of total. -每天的交易量占总交易量的20-25%。
-Volume in a day less than 20% of total. -一天的数量少于总量的20%。
Reproducible Example: 可重现的示例:
Store_Num <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
#Total deliveries made per week
Sun_Del <- c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
Mon_Del <- c(10, 50, 51, 7, 80, 97, 21, 49, 30, 3)
Tue_Del <- c(7, NA, 2, 50, 5, 56, 1, 4, 35, 52)
Wed_Del <- c(49, 51, 1, 4, 51, 16, 2, 2, 1, 1)
Thu_Del <- c(3, 2, 47, 7, 40, 2, 6, 5, 1, 7)
Fri_Del <- c(50, 49, 3, 51, 53, 86, 9, 52, 25, 52)
Sat_Del <- c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
Total <- c(119, 152, 104, 119, 229, 257, 39, 112, 92, 115)
#Single dataset
Schedule <- data.frame(Store_Num, Sun_Del, Mon_Del, Tue_Del,
Wed_Del, Thu_Del, Fri_Del, Sat_Del, Total)
Schedule
Store_Num Sun_Del Mon_Del Tue_Del Wed_Del Thu_Del Fri_Del Sat_Del Total
1 1 NA 10 7 49 3 50 NA 119
2 2 NA 50 NA 51 2 49 NA 152
3 3 NA 51 2 1 47 3 NA 104
4 4 NA 7 50 4 7 51 NA 119
5 5 NA 80 5 51 40 53 NA 229
6 6 NA 97 56 16 2 86 NA 257
7 7 NA 21 1 2 6 9 NA 39
8 8 NA 49 4 2 5 52 NA 112
9 9 NA 30 35 1 1 25 NA 92
10 10 NA 3 52 1 7 52 NA 115
Desired Output: 所需输出:
Store_Num Sun_Del Mon_Del Tue_Del Wed_Del Thu_Del Fri_Del Sat_Del Total Del_Sch
1 1 NA 10 7 49 3 50 NA 119 WF
2 2 NA 50 NA 51 2 49 NA 152 MWF
3 3 NA 51 2 1 47 3 NA 104 MTh
4 4 NA 7 50 4 7 51 NA 119 TF
5 5 NA 80 5 51 40 53 NA 229 MWF
6 6 NA 97 56 16 2 86 NA 257 MTF
7 7 NA 21 1 2 6 9 NA 39 M
8 8 NA 49 4 2 5 52 NA 112 MF
9 9 NA 30 35 1 1 25 NA 92 MTF
10 10 NA 3 52 1 7 52 NA 115 TF
Using tidyr
and dplyr
. 使用
tidyr
和dplyr
。 I made the names be the first two letter pasted to fix the Tuesday/Thursday confusion: 我将名称作为粘贴的前两个字母来解决星期二/星期四的混乱:
library(dplyr)
library(tidyr)
Schedule %>% gather(Day, del, -Store_Num, -Total) %>%
mutate(proportion = ifelse(del/Total >= 0.5, 1,
ifelse(del/Total >= 0.32, 2,
ifelse(del/Total >= 0.25, 3,
ifelse(del/Total >= 0.20, 4,
NA))))) %>%
group_by(Store_Num) %>%
summarise(days = paste0(substr(Day[which(
proportion == min(proportion, na.rm = TRUE))],
1, 2), collapse = "")) %>%
merge(Schedule, ., by = "Store_Num")
Store_Num Sun_Del Mon_Del Tue_Del Wed_Del Thu_Del Fri_Del Sat_Del Total days
1 1 NA 10 7 49 3 50 NA 119 WeFr
2 2 NA 50 NA 51 2 49 NA 152 MoWeFr
3 3 NA 51 2 1 47 3 NA 104 MoTh
4 4 NA 7 50 4 7 51 NA 119 TuFr
5 5 NA 80 5 51 40 53 NA 229 Mo
6 6 NA 97 56 16 2 86 NA 257 MoFr
7 7 NA 21 1 2 6 9 NA 39 Mo
8 8 NA 49 4 2 5 52 NA 112 MoFr
9 9 NA 30 35 1 1 25 NA 92 MoTu
10 10 NA 3 52 1 7 52 NA 115 TuFr
Edit: there are a couple of mismatches between my results and your data (line 5,6 and 9), according to your rules, you have mistakes there. 编辑:根据您的规则,我的结果和您的数据(第5、6和9行)之间存在一些不匹配的地方,那里存在错误。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.