简体   繁体   中英

Creating subset of rows depending on multiple values in one column

I have a data file like this, where the first two columns are corresponding gene names in human and rat(basically same name with different capitalization)

Human   Rat     RNAtype    Exp  Organ
BBS1    Bbs1    reg         7   Lung
ASAP2   Asap2   reg         5   Heart
ASAP2   Asap2   CANONICAL   5   Heart
ASAP2   Asap2   reg         6   Heart
ASAP2   Asap2   reg         8   Lung
ASAP3   Asap3   SCRAMBLED   5   Heart
ASAP3   Asap3   scram       8   Heart
ASAP3   Asap3   CANONICAL   5   Heart
ASAP3   Asap3   reg         5   Heart

Now I want to make subset of it in the following way. If a gene is present in both Lung and heart then I will include this in my subset. for example, here the gene in the first line, BBS1 is present in only Lung but not in Heart, so it won't be in the subset. But ASAP2 is present in both heart and lung. So, it will in the subset. ASAP3 won't be in the subset since it's not present in Lung.

Now the actual data file is pretty big. And there are lots of genes and the order of gene name coming in the first/second column is not ordered like the way here ie it can be ASAP2,ASAP3,ASAP3,BBS1,ASAP2, ASAP2.

Here's another strategy:

> u <- with(df, by(Human, Organ, unique))
> df[df$Human %in% intersect(u[[1]], u[[2]]), ]
  Human   Rat   RNAtype Exp Organ
2 ASAP2 Asap2       reg   5 Heart
3 ASAP2 Asap2 CANONICAL   5 Heart
4 ASAP2 Asap2       reg   6 Heart
5 ASAP2 Asap2       reg   8  Lung

Try this:

# read in sample data
df<-read.table(text="
Human   Rat     RNAtype    Exp  Organ
BBS1    Bbs1    reg         7   Lung
ASAP2   Asap2   reg         5   Heart
ASAP2   Asap2   CANONICAL   5   Heart
ASAP2   Asap2   reg         6   Heart
ASAP2   Asap2   reg         8   Lung
ASAP3   Asap3   SCRAMBLED   5   Heart
ASAP3   Asap3   scram       8   Heart
ASAP3   Asap3   CANONICAL   5   Heart
ASAP3   Asap3   reg         5   Heart", header=TRUE)

merge(df[df$Organ=="Heart",], df[df$Organ=="Lung",], by=c("Human","Rat"))

  Human   Rat RNAtype.x Exp.x Organ.x RNAtype.y Exp.y Organ.y
1 ASAP2 Asap2       reg     5   Heart       reg     8    Lung
2 ASAP2 Asap2 CANONICAL     5   Heart       reg     8    Lung
3 ASAP2 Asap2       reg     6   Heart       reg     8    Lung

Alternatively:

genes2keep <- intersect(df[df$Organ=="Heart",]$Human, df[df$Organ=="Lung",]$Human)
df[df$Human %in% genes2keep, ]

  Human   Rat   RNAtype Exp Organ
2 ASAP2 Asap2       reg   5 Heart
3 ASAP2 Asap2 CANONICAL   5 Heart
4 ASAP2 Asap2       reg   6 Heart
5 ASAP2 Asap2       reg   8  Lung

If your Organ column only contains those two values Lung and Heart, here's another approach:

library(dplyr)

dfsub <- df %>%
  group_by(Human) %>%
  filter(length(unique(Organ)) >= 2)

#Source: local data frame [4 x 5]
#Groups: Human
#
#  Human   Rat   RNAtype Exp Organ
#1 ASAP2 Asap2       reg   5 Heart
#2 ASAP2 Asap2 CANONICAL   5 Heart
#3 ASAP2 Asap2       reg   6 Heart
#4 ASAP2 Asap2       reg   8  Lung

A data.table solution

library(data.table)
setDT(df)[, indx := length(unique(Organ)) >= 2, by = Rat][indx == TRUE]

##    Human   Rat   RNAtype Exp Organ indx
## 1: ASAP2 Asap2       reg   5 Heart TRUE
## 2: ASAP2 Asap2 CANONICAL   5 Heart TRUE
## 3: ASAP2 Asap2       reg   6 Heart TRUE
## 4: ASAP2 Asap2       reg   8  Lung TRUE

With the courtesy of @Arun, here are couple even cooler data.table solutions

setDT(df)[, .SD[length(unique(Organ)) >= 2L], by=Rat]

Or

setDT(df)[df[, .I[length(unique(Organ))>=2L], by=Rat]$V1]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM