Trying to consolidate sample pairs and their variables based on whether the sample pair has a TRUE or FALSE boolean in one or both of its sample type. some samples may have only one sample type, but never more than 1 A and 1 B type samples.
for the dataframe below:
a b c d e f g h samples_name sample_type
1 FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE PAEEYP A
2 FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE PAEEYP B
3 FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE PAERAH A
4 FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE PAERAH B
5 FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE PAKIYW A \\only has A sample
4 possible values 1) FALSE = both FALSE; 2)A = TRUE in A only; 3)B = TRUE in B only, 4)TRUE = both TRUE
a b c d e f g h samples_name
1 FALSE B FALSE FALSE FALSE A FALSE FALSE PAEEYP
2 FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE PAERAH
3 FALSE FALSE FALSE A A FALSE FALSE FALSE PAKIYW
I am stuck and don't know how to do it. I suppose I need to subset/group them by samples name, sort them based on sample type, then apply some colwise ifelse function in each subset/group before merging into a dataframe. I thought about using ddply to do the subsetting and apply colwise function but I can't get my head around. Somehow I think I am overthinking the problem, any help will be appreciated .
I ran into some issues because your desired output mixes logical and character...
This solution is not the most pretty. It is hacked together on the fly ;-).
But perhaps it will set you in the right direction, or inspire others to come up with better answers...
sample data
library( data.table )
DT <- fread("a b c d e f g h samples_name sample_type
FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE PAEEYP A
FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE PAEEYP B
FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE PAERAH A
FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE PAERAH B
FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE PAKIYW A")
code
#melt to long
DT.melt <- melt( DT, id.vars = c( "samples_name", "sample_type" ) )
#set TRUE/FALSE to 1/0
DT.melt[, value := as.numeric( value )]
#only keep roes where value > 0
ans <- DT.melt[ !value == 0, ]
ans <- ans[, .(total = paste0(sample_type, collapse = "")), by = .(samples_name, variable)]
ans[ total == "AB", total := "TRUE"]
# samples_name variable total
# 1: PAEEYP b B
# 2: PAKIYW d A
# 3: PAKIYW e A
# 4: PAEEYP f A
# 5: PAERAH h TRUE
#create new melt without the sample_type
DT.melt2 <- melt( DT, id.vars = c( "samples_name" ), measure.vars = patterns("^[a-h]$") )
#set value to character, drop duplicates
DT.melt2 <- unique( DT.melt2[, value := as.character(value)], by = c("samples_name", "variable"))
#update join answer
DT.melt2[ ans, value := i.total, on = .(samples_name, variable)]
#and cast back to wide format
dcast(DT.melt2, samples_name ~ variable, value.var = "value")
output
# samples_name a b c d e f g h
# 1: PAEEYP FALSE B FALSE FALSE FALSE A FALSE FALSE
# 2: PAERAH FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
# 3: PAKIYW FALSE FALSE FALSE A A FALSE FALSE FALSE
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.