簡體   English   中英

基於標識符的 Grep 行

[英]Grep lines based on identifier

我的數據框如下所示:

hsa-let-7a-3p   45
hsa-let-7a-5p   1148
hsa-let-7b-3p   8
hsa-let-7b-5p   184
hsa-let-7c-3p   1
hsa-let-7c-5p   258
hsa-let-7d-5p   343

我想計算每個標識符有多少行同時具有 3p 和 5p,而這只有 3p 和只有 5p。 例如hsa-let-7a hsa-let-7bhsa-let-7c都有 3p 和 5p。 但是, hsa-let-7d只有 5p。 我不在乎后面的數字。 我更喜歡基於 grep 的解決方案,但 R 也不錯。

輸出:

Both 3p and 5p: 3
Only 3p: 0
Only 5p: 1

我的嘗試我 R:

> head(Meister_Ago1,20)


             V1   V2
1  hsa-let-7a-2-3p    1
2    hsa-let-7a-3p   45
3    hsa-let-7a-5p 1148
4    hsa-let-7b-3p    8
5    hsa-let-7b-5p  184
6    hsa-let-7c-3p    1
7    hsa-let-7c-5p  258
8    hsa-let-7d-3p   22
9    hsa-let-7d-5p  142
10   hsa-let-7e-3p    1
11   hsa-let-7e-5p  114
12 hsa-let-7f-1-3p    1
13 hsa-let-7f-2-3p   10
14   hsa-let-7f-5p  794
15   hsa-let-7g-3p    2
16   hsa-let-7g-5p   94
17   hsa-let-7i-3p    2
18   hsa-let-7i-5p   97
19    hsa-miR-1-3p    4
20    hsa-miR-1-5p    0

或許

grp <-  sub('-..$', '', df$Col1)
val <- sub('.*(..)$', '\\1', df$Col1)
tbl <- table(grp, val)
sum(rowSums(tbl)==2)
#[1] 3

或者

sum(tbl[,1] &tbl[,2])
 #[1] 3
sum(tbl[,1]==0 & tbl[,2]!=0)
#[1] 1
 sum(tbl[,1]!=0 & tbl[,2]==0)
#[1] 0

更新

基於更新數據“Meister_Ago1”

  grp <- sub('-..$', '', Meister_Ago1$V1)
  val <- sub('.*(..)$', '\\1', Meister_Ago1$V1)
  tbl <- table(grp, val)

  sum(tbl[,1] & tbl[,2])
  #[1] 8
  sum(tbl[,1]==0 & tbl[,2]!=0)
  #[1] 1
   sum(tbl[,1]!=0 & tbl[,2]==0)
  #[1] 3

數據

df <- structure(list(Col1 = c("hsa-let-7a-3p", "hsa-let-7a-5p",
"hsa-let-7b-3p", "hsa-let-7b-5p", "hsa-let-7c-3p", "hsa-let-7c-5p", 
"hsa-let-7d-5p"), Col2 = c(45L, 1148L, 8L, 184L, 1L, 258L, 343L)), 
.Names = c("Col1", "Col2"), class = "data.frame", row.names = c(NA, 
-7L))


Meister_Ago1 <- structure(list(V1 = c("hsa-let-7a-2-3p", "hsa-let-7a-3p", 
 "hsa-let-7a-5p", "hsa-let-7b-3p", "hsa-let-7b-5p", "hsa-let-7c-3p", 
 "hsa-let-7c-5p", "hsa-let-7d-3p", "hsa-let-7d-5p", "hsa-let-7e-3p", 
 "hsa-let-7e-5p", 
 "hsa-let-7f-1-3p", "hsa-let-7f-2-3p", "hsa-let-7f-5p", "hsa-let-7g-3p", 
 "hsa-let-7g-5p", "hsa-let-7i-3p", "hsa-let-7i-5p", "hsa-miR-1-3p", 
 "hsa-miR-1-5p"), V2 = c(1L, 45L, 1148L, 8L, 184L, 1L, 258L, 22L, 
  142L, 1L, 114L, 1L, 10L, 794L, 2L, 94L, 2L, 97L, 4L, 0L)), 
 .Names = c("V1", "V2"), class = "data.frame", row.names = 
 c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", 
 "13", "14", "15", "16", "17", "18", "19", "20"))

這個 awk 代碼應該這樣做:

 awk '{s=h=$1;sub(/-.p$/,"",h);all[h]}
        s~/-3p$/{a[h]} s~/-5p$/{b[h]}
        END{ for(x in all)
                if( x in b && x in a){
                        ca++;
                        delete b[x]
                        delete a[x]
                }
        printf "Both 3p and 5p:%d\n", ca
        printf "Only 3p :%d\n", length(a)
        printf "Only 5p :%d\n", length(b)
}' file

輸出:

Both 3p and 5p:3
Only 3p :0
Only 5p :1

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM