Create column of ascending values based on unique values in another column in R with new data added in batches

Question

I have asked a similar question in Python ( How to create column of ascending values based on unique values in another column in pandas ), and got the script working, but for various reasons I need to achieve the same thing in R now. I am also adding complexity here of wanting to be able to add new batches of data to the dataset periodically.

I have a list of samples which have unique sample ID numbers ("Sample_ID"). Each row of the dataset is a sample. Some samples are duplicated multiple times. I want to create a new set of sample names ("Sample_code") that ascends up from 1 as you go down the rows using a prefix (eg "SAMP00001", "SAMP00002" etc). I want the order of rows to be preserved (as they are roughly in date order of sample collection). And for duplicated samples, I want the number given for Sample_code to correspond to the first row that sample ID appears in, not rows further down the table (which came later in sample collection).

My starting data is illustrated with df1:

# df1
Sample_ID <- c('123123','123456','123123','123789')
Date <- c('15/06/2019', '23/06/2019', '30/06/2019', '07/07/2019')
Variable <- c("blue","red","red","blue")
Batch <- 1
df1 <- data.frame(Sample_ID, Date, Variable, Batch)
df1

I want to create the Sample_code column shown in df1b:

# df1b
Sample_ID <- c('123123','123456','123123','123789')
Date <- c('15/06/2019', '23/06/2019', '30/06/2019', '07/07/2019')
Variable <- c("blue","red","red","blue")
Batch <- 1
Sample_code <- c('SAMP0001', 'SAMP0002', 'SAMP0001', 'SAMP0003')

df1b <- data.frame(Sample_ID, Date, Variable, Batch, Sample_code)
df1b

I would save df1b at this point and those Sample_code names used for downstream processing. The added complexity comes because I will then collect a new batch of samples - let's call it df2 (Batch 2 samples):

# df2
Sample_ID <- c('456789', '123654', '123123', '123789', '121212')
Date <- c('15/07/2019', '31/07/2019', '12/08/2019', '27/08/2019', '31/08/2019')
Variable <- c("blue", "red","blue", "red", "red")
Batch <- 2

df2 <- data.frame(Sample_ID, Date, Variable, Batch)
df2

I want to rbind df2 to the bottom of df1, and generate more Sample_code names for the new rows. Importantly, the new Sample_code names need to take account of any Sample_ID duplicates that were present in df1, but also not change any of the Sample_code names that were already assigned back when I only had df1. The result at this point would be df2b, below:

# df2b
Sample_ID <- c('123123','123456','123123','123789','456789', '123654', '123123', '123789', '121212')
Date <- c('15/06/2019', '23/06/2019', '30/06/2019', '07/07/2019', '15/07/2019', '31/07/2019', '12/08/2019', '27/08/2019', '31/08/2019')
Variable <- c("blue","red","red","blue","blue", "red","blue", "red", "red")
Batch <- c(1,1,1,1,2,2,2,2,2)
Sample_code <- c('SAMP0001', 'SAMP0002', 'SAMP0001', 'SAMP0003', 'SAMP0004', 'SAMP0005', 'SAMP0001', 'SAMP0003', 'SAMP0006')
df2b <- data.frame(Sample_ID, Date, Variable, Batch, Sample_code)
df2b

And then I would add Batch 3 samples in the same way etc etc.

I appreciate there are at least 2 stages to this problem: 1) Producing an ascending list of Sample_code names using unique Sample_ID values; and 2) Building in an iterative way of adding batches of samples. But because the second point impacts on the functionality I want for the Sample_code names I have included both stages here.

Lastly - ideally I want to only use base R and tidyverse packages for this.

Any help much appreciated! Thanks.

Answer 1

Because you need to know all possible sample IDs before sample code assignment, consider reversing the order by calling rbind on all sample data frames. Then assign the Sample_code using factor levels. Otherwise, re-assign Sample_code with each batch data frame.

# BUILD A LIST OF DATA FRAMES BY CALLING lapply ON ITERATIVE PROCESS 
# df_list <- lapply(batch_iterable, method_to_build_sample)
df_list <- list(df1, df1b, df2)       # FOR THIS PARTICULAR POST

# RBIND ALL DFs TOGETHER
df2b <- do.call(rbind, df_list)

df2b <- within(df2b, {
    # CONVERT TO CHARACTER
    Sample_ID <- as.character(Sample_ID)

    # CONVERT TO FACTOR AT POSITIONED VALUES, THEN INTEGER FOR LEVEL NUMBER
    Sample_code <- as.character(as.integer(factor(Sample_ID, levels = unique(Sample_ID))))

    # RE-ASSIGN WITH SAMP AND LEADING ZEROS 
    Sample_code <- ifelse(nchar(Sample_code) == 1, paste0('SAMP000', Sample_code),
                          ifelse(nchar(Sample_code) == 2, paste0('SAMP00', Sample_code),
                                ifelse(nchar(Sample_code) == 3, paste0('SAMP0', Sample_code), NA)
                         )
                   )
})

df2b
#   Sample_ID       Date Variable Batch Sample_code
# 1    123123 15/06/2019     blue     1    SAMP0001
# 2    123456 23/06/2019      red     1    SAMP0002
# 3    123123 30/06/2019      red     1    SAMP0001
# 4    123789 07/07/2019     blue     1    SAMP0003
# 5    456789 15/07/2019     blue     2    SAMP0004
# 6    123654 31/07/2019      red     2    SAMP0005
# 7    123123 12/08/2019     blue     2    SAMP0001
# 8    123789 27/08/2019      red     2    SAMP0003
# 9    121212 31/08/2019      red     2    SAMP0006

Create column of ascending values based on unique values in another column in R with new data added in batches

Question

1 answers

solution1
0 ACCPTED 2020-03-28 19:25:47

Create column of ascending values based on unique values in another column in R with new data added in batches

Question

1 answers

solution1 0 ACCPTED 2020-03-28 19:25:47

solution1
0 ACCPTED 2020-03-28 19:25:47