简体   繁体   中英

R- How can I update column values only pertaining to rows where column contains beginning of a certain number?

I have a dataset that looks as follows:

ex <- data.frame("id" =c(rep(1234,6)),
                 "case" = c(45,45,45,92,92,93),
                          "cpt" = c(20600,25263,8456,2345,9023,"C1245"),
                         "date"=c("2019-05-21", "2019-08-22","2019-04-12",
                                  "2019-03-01", "2019-02-18", "2019-03-12"),
                         "window"=c(10,20,NA,45,NA,NA),
                         "pay"=c(520,140,2200,230,600,700))

I essentially have two questions: How can I update the cpt column with the cpt value corresponding to the row with the highest pay grouped by case only focusing on any row containing a cpt value beginning with number 2 AND following this, how I can update the window column to replace any value that isn't the maximum with NA (also only focusing on rows where cpt value starts with 2)

I would like the final output to look like so:

ex1 <- data.frame("id" =c(rep(1234,6)),
                 "case" = c(45,45,45,92,92,93),
                 "cpt" = c(20600,20600,20600,2345,2345,"C1245"),
                 "date"=c("2019-05-21", "2019-08-22","2019-04-12",
                          "2019-03-01", "2019-02-18", "2019-03-12"),
                 "window"=c(10,NA,NA,45,NA,NA),
                 "pay"=c(520,140,2200,230,600,700))

I have two codes

ex[, cpt:=cpt[which.max(pay)], by=case]

ex %>% group_by(case) %>% mutate(window=replace(window, (pay < max(pay)), NA)) %>%
  setDT()

but both of these codes apply to the overall max value and will produce

ex2 <- data.frame("id" =c(rep(1234,6)),
                  "case" = c(45,45,45,92,92,93),
                  "cpt" = c(8456,8456,8456,9023,9023,"C1245"),
                  "date"=c("2019-05-21", "2019-08-22","2019-04-12",
                           "2019-03-01", "2019-02-18", "2019-03-12"),
                  "window"=c(NA,NA,NA,NA,NA,NA),
                  "pay"=c(520,140,2200,230,600,700))

Is there any way I could alter the codes I have to get the output I'm looking for? Or any other code that might be more efficient? Thanks in advance!

We create a logical vector ( i1) based on the first character of 'cpt' ( substr - as the OP is interested in those elements that start with "2"), grouped by 'id', 'case', if there are any 'i1', then use that to subset the 'pay' get the index of the 'max' element with which.max and use that to subset the subset of 'cpt' ( cpt[i1] ) or else return the original column 'cpt', then we replace the 'window' where the 'cpt' is not the same as the duplicate column 'cpt1' (original column) to NA and ungroup

library(dplyr)
ex %>% 
  mutate(cpt1 = cpt) %>%
  group_by(id, case) %>% 
  mutate(i1 = substr(cpt, 1, 1) ==  "2", 
     cpt = if(any(i1)) cpt[i1][which.max(pay[i1])] else cpt,
     window = replace(window, cpt != cpt1, NA),
           i1 = NULL, cpt1 = NULL) %>%
    ungroup

-output

# A tibble: 6 x 6
#     id  case cpt   date       window   pay
#  <dbl> <dbl> <chr> <chr>       <dbl> <dbl>
#1  1234    45 20600 2019-05-21     10   520
#2  1234    45 20600 2019-08-22     NA   140
#3  1234    45 20600 2019-04-12     NA  2200
#4  1234    92 2345  2019-03-01     45   230
#5  1234    92 2345  2019-02-18     NA   600
#6  1234    93 C1245 2019-03-12     NA   700

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM