简体   繁体   中英

Grep lines based on identifier

My data frame looks like this:

hsa-let-7a-3p   45
hsa-let-7a-5p   1148
hsa-let-7b-3p   8
hsa-let-7b-5p   184
hsa-let-7c-3p   1
hsa-let-7c-5p   258
hsa-let-7d-5p   343

I would like to calculate how many lines has that has both 3p and 5p for each identifier, and this that has only 3p and only 5p. For instance hsa-let-7a hsa-let-7b and hsa-let-7c all has both 3p and 5p. however, hsa-let-7d only has 5p. I don't care about the numbers behind. I would prefer a grep based solution, but R would also be nice.

Output:

Both 3p and 5p: 3
Only 3p: 0
Only 5p: 1

My try i R:

> head(Meister_Ago1,20)


             V1   V2
1  hsa-let-7a-2-3p    1
2    hsa-let-7a-3p   45
3    hsa-let-7a-5p 1148
4    hsa-let-7b-3p    8
5    hsa-let-7b-5p  184
6    hsa-let-7c-3p    1
7    hsa-let-7c-5p  258
8    hsa-let-7d-3p   22
9    hsa-let-7d-5p  142
10   hsa-let-7e-3p    1
11   hsa-let-7e-5p  114
12 hsa-let-7f-1-3p    1
13 hsa-let-7f-2-3p   10
14   hsa-let-7f-5p  794
15   hsa-let-7g-3p    2
16   hsa-let-7g-5p   94
17   hsa-let-7i-3p    2
18   hsa-let-7i-5p   97
19    hsa-miR-1-3p    4
20    hsa-miR-1-5p    0

May be

grp <-  sub('-..$', '', df$Col1)
val <- sub('.*(..)$', '\\1', df$Col1)
tbl <- table(grp, val)
sum(rowSums(tbl)==2)
#[1] 3

Or

sum(tbl[,1] &tbl[,2])
 #[1] 3
sum(tbl[,1]==0 & tbl[,2]!=0)
#[1] 1
 sum(tbl[,1]!=0 & tbl[,2]==0)
#[1] 0

Update

Based on the updated data "Meister_Ago1"

  grp <- sub('-..$', '', Meister_Ago1$V1)
  val <- sub('.*(..)$', '\\1', Meister_Ago1$V1)
  tbl <- table(grp, val)

  sum(tbl[,1] & tbl[,2])
  #[1] 8
  sum(tbl[,1]==0 & tbl[,2]!=0)
  #[1] 1
   sum(tbl[,1]!=0 & tbl[,2]==0)
  #[1] 3

data

df <- structure(list(Col1 = c("hsa-let-7a-3p", "hsa-let-7a-5p",
"hsa-let-7b-3p", "hsa-let-7b-5p", "hsa-let-7c-3p", "hsa-let-7c-5p", 
"hsa-let-7d-5p"), Col2 = c(45L, 1148L, 8L, 184L, 1L, 258L, 343L)), 
.Names = c("Col1", "Col2"), class = "data.frame", row.names = c(NA, 
-7L))


Meister_Ago1 <- structure(list(V1 = c("hsa-let-7a-2-3p", "hsa-let-7a-3p", 
 "hsa-let-7a-5p", "hsa-let-7b-3p", "hsa-let-7b-5p", "hsa-let-7c-3p", 
 "hsa-let-7c-5p", "hsa-let-7d-3p", "hsa-let-7d-5p", "hsa-let-7e-3p", 
 "hsa-let-7e-5p", 
 "hsa-let-7f-1-3p", "hsa-let-7f-2-3p", "hsa-let-7f-5p", "hsa-let-7g-3p", 
 "hsa-let-7g-5p", "hsa-let-7i-3p", "hsa-let-7i-5p", "hsa-miR-1-3p", 
 "hsa-miR-1-5p"), V2 = c(1L, 45L, 1148L, 8L, 184L, 1L, 258L, 22L, 
  142L, 1L, 114L, 1L, 10L, 794L, 2L, 94L, 2L, 97L, 4L, 0L)), 
 .Names = c("V1", "V2"), class = "data.frame", row.names = 
 c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", 
 "13", "14", "15", "16", "17", "18", "19", "20"))

this awk codes should do it:

 awk '{s=h=$1;sub(/-.p$/,"",h);all[h]}
        s~/-3p$/{a[h]} s~/-5p$/{b[h]}
        END{ for(x in all)
                if( x in b && x in a){
                        ca++;
                        delete b[x]
                        delete a[x]
                }
        printf "Both 3p and 5p:%d\n", ca
        printf "Only 3p :%d\n", length(a)
        printf "Only 5p :%d\n", length(b)
}' file

output:

Both 3p and 5p:3
Only 3p :0
Only 5p :1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM