根據列中的名稱對R中的data.frame進行子設置

Question

我的數據如下所示：

BLOCK1  0   1   Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491    513 C   A   0/1:23,12:35:99:262,0,691   19,10:-40.6,-28.8,-78.7:-11.9:6.0
2   3   

BLOCK1  0   1   Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491    1095    G   A   0/1:35,12:47:99:328,0,1157  30,11:-61.1,-63.4,-134.7:2.2:12.0
3   4   

BLOCK1  0   1   Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491    1217    G   A   0/1:22,12:34:99:314,0,730   20,10:-68.4,-54.2,-109.0:-14.2:6.0
4   5   

BLOCK1  0   1   Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491    1219    A   C   0/1:22,12:34:99:308,0,715   20,10:-69.9,-54.2,-107.7:-15.7:6.0
5   6   

BLOCK1  0   1   Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491    1721    G   C   0/1:15,6:21:99:141,0,464    7,5:-21.8,-18.5,-30.1:-3.3:4.0
6   8   

BLOCK2  0   1   Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491    2171    G   C   0/1:14,13:27:99:388,0,369   9,5:-35.3,-26.5,-46.7:-8.7:3.0
7   10  

BLOCK3  1   0   Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491    3661    G   A   0/1:148,55:203:99:1070,0,4008   107,39:-163.0,-160.9,-438.4:-2.1:33.0
8   11  

BLOCK3  1   0   Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491    3700    C   T   0/1:124,124:249:99:3271,0,3667  117,107:-510.2,-163.3,-565.9:-346.9:4.0
9   12  

BLOCK3  1   0   Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491    3754    T   C   0/1:140,107:248:99:2786,0,3946  133,101:-436.9,-85.9,-558.8:-351.0:2.0
10

我想要的只是一個R命令，它將允許我計算第2列和第3列（分別是0和1的列）的各種屬性，並執行FOR EACH塊（column1）。 因此，例如，下面的BLOCK1有4行，BLOCK2有1行，依此類推。我想知道的一個基本問題是：對於每個塊，第2列中有多少個零，第3列中有多少個零？

有人可以幫忙嗎？ 我嘗試使用各種形式的aggregate（），但問題是FUN參數不允許我執行上述操作。 或者也許可以，但是我不知道...

Answer 1

您可以使用base R aggregate

 aggregate(!df[,c("Col2", "Col3")], list(Col1=df[,"Col1"]), FUN=sum)
 #     Col1 Col2 Col3
 #1  BLOCK1    5    0
 #2  BLOCK2    1    0
 #3  BLOCK3    0    3

或使用data.table

  library(data.table)
  setDT(df)[, lapply(.SD, function(x) sum(!x)), by=Col1]
  #    Col1 Col2 Col3
  #1: BLOCK1    5    0
  #2: BLOCK2    1    0
  #3: BLOCK3    0    3

更新資料

對於組合，也許您需要

   as.data.frame.matrix(table(df[,1],as.character(interaction(df[,-1]))))
   #       0.1 1.0
   #BLOCK1   5   0
   #BLOCK2   1   0
   #BLOCK3   0   3

更新2

如果只想keep only cases in which the blocks have col2=0 AND col3=1 OR col2=1 AND col3=0, for ALL entries of a given block ：

更改示例數據集以顯示一些變化（在當前數據集中，條件將選擇所有行）

  df$Col3[4] <- 0
  df$Col2[8]<-0
  df$Col3[8]<-1
  df[with(df, ave(Col2==0 & Col3==1|Col2==1 & Col3==0, Col1, FUN=all)),]
  #   Col1 Col2 Col3
  #6 BLOCK2    0    1
  #7 BLOCK3    1    0
  #8 BLOCK3    0    1
  #9 BLOCK3    1    0

數據

df <-    structure(list(Col1 = c("BLOCK1", "BLOCK1", "BLOCK1", "BLOCK1", 
"BLOCK1", "BLOCK2", "BLOCK3", "BLOCK3", "BLOCK3"), Col2 = c(0L, 
0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L), Col3 = c(1L, 1L, 1L, 1L, 1L, 
1L, 0L, 0L, 0L)), .Names = c("Col1", "Col2", "Col3"), class = "data.frame", row.names = c(NA, 
-9L))

Answer 2

使用dplyr ：

require(dplyr)

#dummy data
d <- read.table(text="
                BLOCK1  0   1   Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491    513 C   A   0/1:23,12:35:99:262,0,691   19,10:-40.6,-28.8,-78.7:-11.9:6.0
2   3   

BLOCK1  0   1   Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491    1095    G   A   0/1:35,12:47:99:328,0,1157  30,11:-61.1,-63.4,-134.7:2.2:12.0
3   4   

BLOCK1  0   1   Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491    1217    G   A   0/1:22,12:34:99:314,0,730   20,10:-68.4,-54.2,-109.0:-14.2:6.0
4   5   

BLOCK1  0   1   Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491    1219    A   C   0/1:22,12:34:99:308,0,715   20,10:-69.9,-54.2,-107.7:-15.7:6.0
5   6   

BLOCK1  0   1   Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491    1721    G   C   0/1:15,6:21:99:141,0,464    7,5:-21.8,-18.5,-30.1:-3.3:4.0
6   8   

BLOCK2  0   1   Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491    2171    G   C   0/1:14,13:27:99:388,0,369   9,5:-35.3,-26.5,-46.7:-8.7:3.0
7   10  

BLOCK3  1   0   Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491    3661    G   A   0/1:148,55:203:99:1070,0,4008   107,39:-163.0,-160.9,-438.4:-2.1:33.0
8   11  

BLOCK3  1   0   Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491    3700    C   T   0/1:124,124:249:99:3271,0,3667  117,107:-510.2,-163.3,-565.9:-346.9:4.0
9   12  

BLOCK3  1   0   Locus_540_Transcript_32_Length_8324_genewise_newlength_8215__CDS__3870__6491    3754    T   C   0/1:140,107:248:99:2786,0,3946  133,101:-436.9,-85.9,-558.8:-351.0:2.0
10  ",fill = TRUE)

#keep only rows with BLOCK names and count zeros in column 2
d %>% filter(grepl("BLOCK",V1)) %>%
  group_by(BLOCK=V1) %>%
  summarise(ZeroCountInCol2=sum(V2==0))

# BLOCK ZeroCountInCol2
# 1 BLOCK1               5
# 2 BLOCK2               1
# 3 BLOCK3               0

Answer 3

如果data.frame的名稱是dataframe：

sapply(unique(dataframe[,1]),function(block){list(nb0_col1=sum(dataframe[dataframe[,1]==block,2]==0,na.rm=T),nb0_col2=sum(dataframe[dataframe[,1]==block,3]==0,na.rm=T))})

Answer 4

因此，上述解決方案並沒有真正提供我想要的，但是我想出了如何在bash中做到這一點。 我想解析該文件以保留對於給定塊而言所有col2 = 0和所有col3 = 1或所有col2 = 1和所有col3 = 0的塊。 而且我還希望計算此類塊的數量。 以下命令有效：

sed'：a; N; $！ba; s / \\ nBLOCK / \\ n \\ nBLOCK / g; s / $ / \\ n /'input.file | awk'BEGIN {fg = 0; num = 0; block =“”; type =“”;} {if（/ ^ $ / && fg == 1）{打印塊； num ++; fg = 0; block =“”; }否則，如果（/ ^ $ /）{block =“”; type =“”; fg = 0;}否則if（/ ^ BLOCK /）{塊=塊“ \\ n” $ ; 總++; } else if（fg == 0）{type = $ 2“” $ 3; fg = 1; 塊=塊“ \\ n” $ ; } else if（fg == 1 && type == $ 2“” $ 3）{塊=塊“ \\ n” $ _;}否則if（fg == 1 && type！= $ 2“” $ 3）{fg = 2; block =“”}} END {打印“ perfect：\\ t”，num，“ \\ tTotal：\\ t”，total}'> output.file

根據列中的名稱對R中的data.frame進行子設置

問題描述

4 個解決方案

解決方案1
2 2014-10-16 12:28:59

更新資料

更新2

數據

解決方案2
1 2014-10-16 12:35:42

解決方案3
0 2014-10-16 11:50:53

解決方案4
0 2014-10-21 15:55:48

根據列中的名稱對R中的data.frame進行子設置

問題描述

4 個解決方案

解決方案1 2 2014-10-16 12:28:59

更新資料

更新2

數據

解決方案2 1 2014-10-16 12:35:42

解決方案3 0 2014-10-16 11:50:53

解決方案4 0 2014-10-21 15:55:48

解決方案1
2 2014-10-16 12:28:59

解決方案2
1 2014-10-16 12:35:42

解決方案3
0 2014-10-16 11:50:53

解決方案4
0 2014-10-21 15:55:48