[英]creating new columns based on several conditions in R
我有一個由三列組成的數據框,狀態的唯一值如下“X”“0”“C”“1”“2”“3”“4”“5”。 一開始我不知道如何按每個id分組,根據條件創建幾列,比如一個目標列,如果status是2、3、4、5則為1,否則為0。
month_balance 表示(提取數據的月份為起點,倒數,0為當前月份,-1為上個月,以此類推)
status代表(0:逾期1-29天,1:逾期30-59天,2:逾期60-89天,3:逾期90-119天,4:逾期120-149天,5:逾期或不良150天以上的債務核銷C:當月還清,X:當月無貸款)
df <- data.frame (id = c("5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008805","5008805","5008805","5008805","5008805","5008805","5008805","5008805","5008805","5008805","5008805","5008805","5008805","5008805","5008805"),
month_balance = c("0","-1","-2","-3","-4","-5","-6","-7","-8","-9","-10","-11","-12","-13","-14","-15","0","-1","-2","-3","-4","-5","-6","-7","-8","-9","-10","-11","-12","-13","-14"),
status = c("C","C","C","C","C","C","C","C","C","C","C","C","C","1","0","X","C","C","C","C","C","C","C","C","C","C","C","C","1","0","X")
)
最后,我想達到如下輸出:
df1 <- data.frame (id = c("5008804","5008805"),
month_begin = c("16","15"),
paid_off = c("13","12"),
num_of_pastdues = c("2","2"),
no_loan = c("1","1"),
target = c("0","0"))
不太確定如何為target
編碼,因為每個 id 的狀態都出現了 target 0 和 1 多次出現。
以下是我為其他變量構建的方式:
df %>%
group_by(id) %>%
summarise(
month_begin=max(abs(as.numeric(month_balance)))+1,
paid_off=sum(status=="C"),
num_of_pastdues=sum(status %in% 0:5),
no_loan=sum(status=="X"))
# A tibble: 2 x 5
id month_begin paid_off num_of_pastdues no_loan
<chr> <dbl> <int> <int> <int>
1 5008804 16 13 2 1
2 5008805 15 12 2 1
library(tidyverse)
df <- data.frame (id = c("5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008804","5008805","5008805","5008805","5008805","5008805","5008805","5008805","5008805","5008805","5008805","5008805","5008805","5008805","5008805","5008805"),
month_balance = c("0","-1","-2","-3","-4","-5","-6","-7","-8","-9","-10","-11","-12","-13","-14","-15","0","-1","-2","-3","-4","-5","-6","-7","-8","-9","-10","-11","-12","-13","-14"),
status = c("C","C","C","C","C","C","C","C","C","C","C","C","C","1","0","X","C","C","C","C","C","C","C","C","C","C","C","C","1","0","X")
) %>%
as_tibble()
df %>%
mutate(target = case_when(status %in% c(2, 3, 4, 5) ~ 1,
TRUE ~ 0),
paid_off = case_when(status == "C" ~ 1,
TRUE ~ 0),
no_loan = case_when(status == "X" ~ 1,
TRUE ~ 0)) %>%
group_by(id) %>%
summarise(month_begin = n(),
across(c(paid_off, no_loan, target), sum))
#> # A tibble: 2 x 5
#> id month_begin paid_off no_loan target
#> <chr> <int> <dbl> <dbl> <dbl>
#> 1 5008804 16 13 1 0
#> 2 5008805 15 12 1 0
由reprex 包於 2022-06-29 創建 (v2.0.1)
您可以嘗試使用 dplyr。 首先,您可以創建具有所需條件的變量,然后您可以使用匯總來計算每組滿足條件的次數。
df <- df %>%
mutate(num_of_pastdues = case_when(
status %in% c(2,3,4,5) ~ 1,
TRUE ~ 0
)) %>%
mutate(no_loan = case_when(
status == "X" ~ 1,
TRUE ~ 0
)) %>%
mutate(paid_off = case_when(
status == "C" ~ 1,
TRUE ~ 0
)) %>%
group_by(id) %>%
summarise(num_of_pastdues = sum(num_of_pastdues), no_loan = sum(no_loan), paid_off = sum(paid_off))
一個基本的 R 解決方案可以是創建一個自定義函數並將其應用於每個組,即
MyFunction <- function(x){
month_begin = length(x)
paid_off = sum(x == 'C')
num_of_pastdues = sum(x %in% 0:5)
no_loan = sum(x == 'X')
target = ifelse(any(x %in% 2:5), 1, 0)
return(c(month_begin=month_begin, paid_off=paid_off, num_of_pastdues=num_of_pastdues, no_loan=no_loan, target=target))
}
res <- t(sapply(split(df$status, df$id), MyFunction))
month_begin paid_off num_of_pastdues no_loan target
# 5008804 16 13 2 1 0
# 5008805 15 12 2 1 0
然后使其成為具有列 id 的數據框,
res_df <- data.frame(res)
res_df$id <- rownames(res_df)
rownames(res_df) <- NULL
res_df
#month_begin paid_off num_of_pastdues no_loan target id
#1 16 13 2 1 0 5008804
#2 15 12 2 1 0 5008805
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.