简体   繁体   中英

R dataframe - extract pair of rows based on two variables then combine the rows based on custom colwise function

Trying to consolidate sample pairs and their variables based on whether the sample pair has a TRUE or FALSE boolean in one or both of its sample type. some samples may have only one sample type, but never more than 1 A and 1 B type samples.

for the dataframe below:

   a     b     c      d      e     f      g     h      samples_name sample_type
1  FALSE FALSE FALSE  FALSE  FALSE TRUE   FALSE FALSE  PAEEYP         A
2  FALSE TRUE  FALSE  FALSE  FALSE FALSE  FALSE FALSE  PAEEYP         B
3  FALSE FALSE FALSE  FALSE  FALSE FALSE  FALSE TRUE   PAERAH         A
4  FALSE FALSE FALSE  FALSE  FALSE FALSE  FALSE TRUE   PAERAH         B
5  FALSE FALSE FALSE  TRUE   TRUE  FALSE  FALSE FALSE  PAKIYW         A  \\only has A sample

4 possible values 1) FALSE = both FALSE; 2)A = TRUE in A only; 3)B = TRUE in B only, 4)TRUE = both TRUE

   a     b     c      d      e     f      g     h      samples_name
1  FALSE B     FALSE  FALSE  FALSE A      FALSE FALSE  PAEEYP         
2  FALSE FALSE FALSE  FALSE  FALSE FALSE  FALSE TRUE   PAERAH         
3  FALSE FALSE FALSE  A      A     FALSE  FALSE FALSE  PAKIYW    

I am stuck and don't know how to do it. I suppose I need to subset/group them by samples name, sort them based on sample type, then apply some colwise ifelse function in each subset/group before merging into a dataframe. I thought about using ddply to do the subsetting and apply colwise function but I can't get my head around. Somehow I think I am overthinking the problem, any help will be appreciated .

I ran into some issues because your desired output mixes logical and character...

This solution is not the most pretty. It is hacked together on the fly ;-).
But perhaps it will set you in the right direction, or inspire others to come up with better answers...

sample data

library( data.table )

DT <- fread("a     b     c      d      e     f      g     h      samples_name sample_type
  FALSE FALSE FALSE  FALSE  FALSE TRUE   FALSE FALSE  PAEEYP         A
  FALSE TRUE  FALSE  FALSE  FALSE FALSE  FALSE FALSE  PAEEYP         B
  FALSE FALSE FALSE  FALSE  FALSE FALSE  FALSE TRUE   PAERAH         A
  FALSE FALSE FALSE  FALSE  FALSE FALSE  FALSE TRUE   PAERAH         B
  FALSE FALSE FALSE  TRUE   TRUE  FALSE  FALSE FALSE  PAKIYW         A")

code

#melt to long
DT.melt <- melt( DT, id.vars = c( "samples_name", "sample_type" ) )
#set TRUE/FALSE to 1/0
DT.melt[, value := as.numeric( value )]
#only keep roes where value > 0
ans <- DT.melt[ !value == 0, ]
ans <- ans[, .(total = paste0(sample_type, collapse = "")), by = .(samples_name, variable)]
ans[ total == "AB", total := "TRUE"]
#    samples_name variable total
# 1:       PAEEYP        b     B
# 2:       PAKIYW        d     A
# 3:       PAKIYW        e     A
# 4:       PAEEYP        f     A
# 5:       PAERAH        h  TRUE

#create new melt without the sample_type
DT.melt2 <- melt( DT, id.vars = c( "samples_name" ), measure.vars = patterns("^[a-h]$") )
#set value to character, drop duplicates
DT.melt2 <- unique( DT.melt2[, value := as.character(value)], by = c("samples_name", "variable"))
#update join answer
DT.melt2[ ans, value := i.total, on = .(samples_name, variable)]
#and cast back to wide format
dcast(DT.melt2, samples_name ~ variable, value.var = "value")

output

#    samples_name     a     b     c     d     e     f     g     h
# 1:       PAEEYP FALSE     B FALSE FALSE FALSE     A FALSE FALSE
# 2:       PAERAH FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
# 3:       PAKIYW FALSE FALSE FALSE     A     A FALSE FALSE FALSE

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM