[英]subsetting data.frame in R based on names in column
我的數據如下所示:
BLOCK1 0 1 Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491 513 C A 0/1:23,12:35:99:262,0,691 19,10:-40.6,-28.8,-78.7:-11.9:6.0
2 3
BLOCK1 0 1 Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491 1095 G A 0/1:35,12:47:99:328,0,1157 30,11:-61.1,-63.4,-134.7:2.2:12.0
3 4
BLOCK1 0 1 Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491 1217 G A 0/1:22,12:34:99:314,0,730 20,10:-68.4,-54.2,-109.0:-14.2:6.0
4 5
BLOCK1 0 1 Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491 1219 A C 0/1:22,12:34:99:308,0,715 20,10:-69.9,-54.2,-107.7:-15.7:6.0
5 6
BLOCK1 0 1 Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491 1721 G C 0/1:15,6:21:99:141,0,464 7,5:-21.8,-18.5,-30.1:-3.3:4.0
6 8
BLOCK2 0 1 Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491 2171 G C 0/1:14,13:27:99:388,0,369 9,5:-35.3,-26.5,-46.7:-8.7:3.0
7 10
BLOCK3 1 0 Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491 3661 G A 0/1:148,55:203:99:1070,0,4008 107,39:-163.0,-160.9,-438.4:-2.1:33.0
8 11
BLOCK3 1 0 Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491 3700 C T 0/1:124,124:249:99:3271,0,3667 117,107:-510.2,-163.3,-565.9:-346.9:4.0
9 12
BLOCK3 1 0 Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491 3754 T C 0/1:140,107:248:99:2786,0,3946 133,101:-436.9,-85.9,-558.8:-351.0:2.0
10
我想要的只是一個R命令,它將允許我計算第2列和第3列(分別是0和1的列)的各種屬性,並執行FOR EACH塊(column1)。 因此,例如,下面的BLOCK1有4行,BLOCK2有1行,依此類推。我想知道的一個基本問題是:對於每個塊,第2列中有多少個零,第3列中有多少個零?
有人可以幫忙嗎? 我嘗試使用各種形式的aggregate(),但問題是FUN參數不允許我執行上述操作。 或者也許可以,但是我不知道...
您可以使用base R
aggregate
aggregate(!df[,c("Col2", "Col3")], list(Col1=df[,"Col1"]), FUN=sum)
# Col1 Col2 Col3
#1 BLOCK1 5 0
#2 BLOCK2 1 0
#3 BLOCK3 0 3
或使用data.table
library(data.table)
setDT(df)[, lapply(.SD, function(x) sum(!x)), by=Col1]
# Col1 Col2 Col3
#1: BLOCK1 5 0
#2: BLOCK2 1 0
#3: BLOCK3 0 3
對於組合,也許您需要
as.data.frame.matrix(table(df[,1],as.character(interaction(df[,-1]))))
# 0.1 1.0
#BLOCK1 5 0
#BLOCK2 1 0
#BLOCK3 0 3
如果只想keep only cases in which the blocks have col2=0 AND col3=1 OR col2=1 AND col3=0, for ALL entries of a given block
:
更改示例數據集以顯示一些變化(在當前數據集中,條件將選擇所有行)
df$Col3[4] <- 0
df$Col2[8]<-0
df$Col3[8]<-1
df[with(df, ave(Col2==0 & Col3==1|Col2==1 & Col3==0, Col1, FUN=all)),]
# Col1 Col2 Col3
#6 BLOCK2 0 1
#7 BLOCK3 1 0
#8 BLOCK3 0 1
#9 BLOCK3 1 0
df <- structure(list(Col1 = c("BLOCK1", "BLOCK1", "BLOCK1", "BLOCK1",
"BLOCK1", "BLOCK2", "BLOCK3", "BLOCK3", "BLOCK3"), Col2 = c(0L,
0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L), Col3 = c(1L, 1L, 1L, 1L, 1L,
1L, 0L, 0L, 0L)), .Names = c("Col1", "Col2", "Col3"), class = "data.frame", row.names = c(NA,
-9L))
使用dplyr
:
require(dplyr)
#dummy data
d <- read.table(text="
BLOCK1 0 1 Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491 513 C A 0/1:23,12:35:99:262,0,691 19,10:-40.6,-28.8,-78.7:-11.9:6.0
2 3
BLOCK1 0 1 Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491 1095 G A 0/1:35,12:47:99:328,0,1157 30,11:-61.1,-63.4,-134.7:2.2:12.0
3 4
BLOCK1 0 1 Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491 1217 G A 0/1:22,12:34:99:314,0,730 20,10:-68.4,-54.2,-109.0:-14.2:6.0
4 5
BLOCK1 0 1 Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491 1219 A C 0/1:22,12:34:99:308,0,715 20,10:-69.9,-54.2,-107.7:-15.7:6.0
5 6
BLOCK1 0 1 Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491 1721 G C 0/1:15,6:21:99:141,0,464 7,5:-21.8,-18.5,-30.1:-3.3:4.0
6 8
BLOCK2 0 1 Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491 2171 G C 0/1:14,13:27:99:388,0,369 9,5:-35.3,-26.5,-46.7:-8.7:3.0
7 10
BLOCK3 1 0 Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491 3661 G A 0/1:148,55:203:99:1070,0,4008 107,39:-163.0,-160.9,-438.4:-2.1:33.0
8 11
BLOCK3 1 0 Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491 3700 C T 0/1:124,124:249:99:3271,0,3667 117,107:-510.2,-163.3,-565.9:-346.9:4.0
9 12
BLOCK3 1 0 Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491 3754 T C 0/1:140,107:248:99:2786,0,3946 133,101:-436.9,-85.9,-558.8:-351.0:2.0
10 ",fill = TRUE)
#keep only rows with BLOCK names and count zeros in column 2
d %>% filter(grepl("BLOCK",V1)) %>%
group_by(BLOCK=V1) %>%
summarise(ZeroCountInCol2=sum(V2==0))
# BLOCK ZeroCountInCol2
# 1 BLOCK1 5
# 2 BLOCK2 1
# 3 BLOCK3 0
如果data.frame的名稱是dataframe:
sapply(unique(dataframe[,1]),function(block){list(nb0_col1=sum(dataframe[dataframe[,1]==block,2]==0,na.rm=T),nb0_col2=sum(dataframe[dataframe[,1]==block,3]==0,na.rm=T))})
因此,上述解決方案並沒有真正提供我想要的,但是我想出了如何在bash中做到這一點。 我想解析該文件以保留對於給定塊而言所有col2 = 0和所有col3 = 1或所有col2 = 1和所有col3 = 0的塊。 而且我還希望計算此類塊的數量。 以下命令有效:
sed':a; N; $!ba; s / \\ nBLOCK / \\ n \\ nBLOCK / g; s / $ / \\ n /'input.file | awk'BEGIN {fg = 0; num = 0; block =“”; type =“”;} {if(/ ^ $ / && fg == 1){打印塊; num ++; fg = 0; block =“”; }否則,如果(/ ^ $ /){block =“”; type =“”; fg = 0;}否則if(/ ^ BLOCK /){塊=塊“ \\ n” $ ; 總++; } else if(fg == 0){type = $ 2“” $ 3; fg = 1; 塊=塊“ \\ n” $ ; } else if(fg == 1 && type == $ 2“” $ 3){塊=塊“ \\ n” $ _;}否則if(fg == 1 && type!= $ 2“” $ 3){fg = 2; block =“”}} END {打印“ perfect:\\ t”,num,“ \\ tTotal:\\ t”,total}'> output.file
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.