[英]Grep lines based on identifier
我的數據框如下所示:
hsa-let-7a-3p 45
hsa-let-7a-5p 1148
hsa-let-7b-3p 8
hsa-let-7b-5p 184
hsa-let-7c-3p 1
hsa-let-7c-5p 258
hsa-let-7d-5p 343
我想計算每個標識符有多少行同時具有 3p 和 5p,而這只有 3p 和只有 5p。 例如hsa-let-7a
hsa-let-7b
和hsa-let-7c
都有 3p 和 5p。 但是, hsa-let-7d
只有 5p。 我不在乎后面的數字。 我更喜歡基於 grep 的解決方案,但 R 也不錯。
輸出:
Both 3p and 5p: 3
Only 3p: 0
Only 5p: 1
我的嘗試我 R:
> head(Meister_Ago1,20)
V1 V2
1 hsa-let-7a-2-3p 1
2 hsa-let-7a-3p 45
3 hsa-let-7a-5p 1148
4 hsa-let-7b-3p 8
5 hsa-let-7b-5p 184
6 hsa-let-7c-3p 1
7 hsa-let-7c-5p 258
8 hsa-let-7d-3p 22
9 hsa-let-7d-5p 142
10 hsa-let-7e-3p 1
11 hsa-let-7e-5p 114
12 hsa-let-7f-1-3p 1
13 hsa-let-7f-2-3p 10
14 hsa-let-7f-5p 794
15 hsa-let-7g-3p 2
16 hsa-let-7g-5p 94
17 hsa-let-7i-3p 2
18 hsa-let-7i-5p 97
19 hsa-miR-1-3p 4
20 hsa-miR-1-5p 0
或許
grp <- sub('-..$', '', df$Col1)
val <- sub('.*(..)$', '\\1', df$Col1)
tbl <- table(grp, val)
sum(rowSums(tbl)==2)
#[1] 3
或者
sum(tbl[,1] &tbl[,2])
#[1] 3
sum(tbl[,1]==0 & tbl[,2]!=0)
#[1] 1
sum(tbl[,1]!=0 & tbl[,2]==0)
#[1] 0
基於更新數據“Meister_Ago1”
grp <- sub('-..$', '', Meister_Ago1$V1)
val <- sub('.*(..)$', '\\1', Meister_Ago1$V1)
tbl <- table(grp, val)
sum(tbl[,1] & tbl[,2])
#[1] 8
sum(tbl[,1]==0 & tbl[,2]!=0)
#[1] 1
sum(tbl[,1]!=0 & tbl[,2]==0)
#[1] 3
df <- structure(list(Col1 = c("hsa-let-7a-3p", "hsa-let-7a-5p",
"hsa-let-7b-3p", "hsa-let-7b-5p", "hsa-let-7c-3p", "hsa-let-7c-5p",
"hsa-let-7d-5p"), Col2 = c(45L, 1148L, 8L, 184L, 1L, 258L, 343L)),
.Names = c("Col1", "Col2"), class = "data.frame", row.names = c(NA,
-7L))
Meister_Ago1 <- structure(list(V1 = c("hsa-let-7a-2-3p", "hsa-let-7a-3p",
"hsa-let-7a-5p", "hsa-let-7b-3p", "hsa-let-7b-5p", "hsa-let-7c-3p",
"hsa-let-7c-5p", "hsa-let-7d-3p", "hsa-let-7d-5p", "hsa-let-7e-3p",
"hsa-let-7e-5p",
"hsa-let-7f-1-3p", "hsa-let-7f-2-3p", "hsa-let-7f-5p", "hsa-let-7g-3p",
"hsa-let-7g-5p", "hsa-let-7i-3p", "hsa-let-7i-5p", "hsa-miR-1-3p",
"hsa-miR-1-5p"), V2 = c(1L, 45L, 1148L, 8L, 184L, 1L, 258L, 22L,
142L, 1L, 114L, 1L, 10L, 794L, 2L, 94L, 2L, 97L, 4L, 0L)),
.Names = c("V1", "V2"), class = "data.frame", row.names =
c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12",
"13", "14", "15", "16", "17", "18", "19", "20"))
這個 awk 代碼應該這樣做:
awk '{s=h=$1;sub(/-.p$/,"",h);all[h]}
s~/-3p$/{a[h]} s~/-5p$/{b[h]}
END{ for(x in all)
if( x in b && x in a){
ca++;
delete b[x]
delete a[x]
}
printf "Both 3p and 5p:%d\n", ca
printf "Only 3p :%d\n", length(a)
printf "Only 5p :%d\n", length(b)
}' file
輸出:
Both 3p and 5p:3
Only 3p :0
Only 5p :1
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.