简体   繁体   中英

R - Remove one of a pair of rows for each pair in dataframe based on condition

I am writing a script to process data and need one of a pair of rows removed from a data set. In the example below I want to keep the first dilution (which will always be smaller than the second) if it is below 20,000 but select the 2nd dilution if the first is over 20,000 no matter what the second dilution is. The exact dilution values will vary from dataset to dataset but it will never be more than two dilutions for each patient so I will always want to check the lowest dilution first against the threshold of 20,000 which will remain the same. Also this data set contains a lot of columns containing meta data.

Patient   Dilution   Value 
John      2          30000
John      20         15000
George    2          13000
George    20         700
Kelly     2          49000
Kelly     20         24000
Tom       2          80000
Tom       20         30000
Diane     2          700
Diane     20         0

Patient   Dilution   Value
John      20         15000
George    2          13000
Kelly     20         24000
Tom       20         30000
Diane     2          700

If you would like to look at the rest of my code here it is (yes I am a noob).

###SA Summary

sadf <- merge(mydata, elisadata, "Description", all.x = TRUE)

sadf <- sadf[grep("X", sadf$Type),]
sadf <- sadf[-grep("Blank", sadf$Name),]
sadf <- sadf[-grep("MulV", sadf$Name),]
sadf <- sadf[,c("Isotype","Name","Description","Dilution.x","FI-Bkgd-Neg","Error","Conc..ug.ml.")]

sadf$Error <- as.character(sadf$Error)
sadf$Error[sadf$Conc..ug.ml. < 0.05] <- "LC"
sadf$Conc..ug.ml. <- ifelse(!is.na(sadf$Conc..ug.ml.) & sadf$Conc..ug.ml. < 0.05, NA, sadf$Conc..ug.ml.)

sadf$SA <- with(sadf, sadf$`FI-Bkgd-Neg` * sadf$Dilution.x / sadf$Conc..ug.ml.)

sadf$SA[sadf$SA < 0.02] <- 0.02

if (unique(sadf$Dilution) > 1) {} ###Where I need to put the answer to the question

sadf$`FI-Bkgd-Neg` <- NULL
sadf$Error[is.na(sadf$Error)] <- 0
sadf$Conc..ug.ml.[is.na(sadf$Conc..ug.ml.)] <- 0
sadf <- reshape(sadf, idvar = c("Description","Dilution.x","Isotype","Error","Conc..ug.ml."), timevar = "Name", direction = "wide")
sadf$Error[sadf$Error = 0] <- NA
sadf$Conc..ug.ml.[sadf$Conc..ug.ml. = 0] <- NA

With dplyr , group_by patient, and then filter to the rows (for the grouped-by patient) that satisfy the condition. The condition returns the last Value if the first is over 20000, else the min imum.

library(dplyr)
df %>% group_by(Patient) %>% filter(Value == ifelse(first(Value) > 20000, 
                                                    last(Value), 
                                                    min(Value)))
# Source: local data frame [5 x 3]
# Groups: Patient [5]
# 
#   Patient Dilution Value
#    (fctr)    (int) (int)
# 1    John       20 15000
# 2  George       20   700
# 3   Kelly       20 24000
# 4     Tom       20 30000
# 5   Diane       20     0

Note: this approach follows the wording of the question, which would not return the resulting data.frame in the question. If the condition is supposed to return the first dilution if it is under 20000, all you need to do is change min to first , and you get the result data frame from the question:

df %>% group_by(Patient) %>% filter(Value == ifelse(first(Value) > 20000, 
                                                    last(Value), 
                                                    first(Value)))
# Source: local data frame [5 x 3]
# Groups: Patient [5]
# 
#   Patient Dilution Value
#    (fctr)    (int) (int)
# 1    John       20 15000
# 2  George        2 13000
# 3   Kelly       20 24000
# 4     Tom       20 30000
# 5   Diane        2   700

We can use data.table . Convert the 'data.frame' to 'data.table' ( setDT(df) ), grouped by 'Patient', we use the if/else condition to subset the rows with the min 'Value' if present of else get the last one.

setDT(df1)[df1[ ,  .I[if(min(Value) <20000) 
        which.min(Value) else .N] , Patient]$V1]
#    Patient Dilution Value
#1:    John       20 15000
#2:  George       20   700
#3:   Kelly       20 24000
#4:     Tom       20 30000
#5:   Diane       20     0

If the condition is based on the first "Value", we need to make changes from min(Value) to first(Value) or Value[1L] and also use 1 instead of which.min

setDT(df1)[df1[ ,  .I[if(Value[1L] <20000) 
              1 else .N], Patient]$V1]
#   Patient Dilution Value
#1:    John       20 15000
#2:  George        2 13000
#3:   Kelly       20 24000
#4:     Tom       20 30000
#5:   Diane        2   700

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM