简体   繁体   English

R:使用“ for-loop”(遍历多列)创建多个矩阵,以分隔“:”定界的VCF文件

[英]R: Splitting up a “:” delimited VCF file, using a 'for-loop' (iterating over several columns) to create multiple matrices

Why am I asking this? 我为什么要问这个?

It seems that many people have issues with both splitting up VCF files, and iterating over columns with a for-loop, but I haven't come across any questions that tackles the two in a way relevant to working with a VCF file containing many samples - as will be explained. 似乎很多人在拆分VCF文件和使用for循环遍历列时都遇到问题,但是我没有遇到任何与使用包含多个示例的VCF文件相关的方法来解决这两个问题-将会解释。

Here is an example of the data structure : 这是数据结构的示例

Loci    Sample1
[1]     0/1:15:55:54:49:5:9.26%:2.8371E-2:37:36:49:0:5:0
[2]     0/1:42:55:53:40:13:24.53%:5.2873E-5:34:37:40:0:13:0
[3]     0/1:15:54:54:49:5:9.26%:2.8371E-2:35:33:49:0:5:0

The question is how to create an eye-friendly table over many loci (rows) and multiple samples (columns) with lots of output statistics (each separated by ":")? 问题是如何在许多位点(行)和多个样本(列)上创建具有许多输出统计信息(每个均由“:”分隔)的直观表?

I have managed to solve half of this problem : 我已经设法解决了这个问题的一半

I have developed an R script which can take the information from a single sample column and output a matrix that separates each individual statistic. 我开发了一个R脚本,该脚本可以从单个示例列中获取信息,并输出一个将每个单独的统计信息分开的矩阵。 The code is as follows: 代码如下:

data <- vcf.small

# First, create a list representing each row (locus) and separate the
# statistics; second, breakdown the list's structure but maintain data order.
split1 <-strsplit(as.character(data$Sample1),":")
split2 <- unlist(split1)

# Create a matrix: here, there are 14 values by 3 loci.
mtx1a <- matrix(split2, ncol=14, nrow=3, dimnames=list(NULL,c("GT","GQ","SDP","DP","RD","AD","FREQ","PVAL","RBQ","ABQ","RDF","RDR","ADF","ADR")), byrow=TRUE)

# Create some additional variables (columns) to add to the matrix.
sample <- matrix(rep(1,3), ncol=1, nrow=3, dimnames=list(NULL,c("SAMPLE")))
locus <- matrix(1:3, ncol=1, nrow=3, dimnames=list(NULL,c("LOCUS")))

# Add them to the matrix.
mtx1b <- cbind(mtx1a,sample)
mtx1b <- cbind(mtx1b,locus)

Voila, the output: Voila,输出:

     GT    GQ   SDP  DP   RD   AD   FREQ     PVAL        RBQ  ABQ  RDF  RDR ADF  ADR SAMPLE LOCUS
[1,] "0/1" "15" "55" "54" "49" "5"  "9.26%"  "2.8371E-2" "37" "36" "49" "0" "5"  "0" "1"    "1"  
[2,] "0/1" "42" "55" "53" "40" "13" "24.53%" "5.2873E-5" "34" "37" "40" "0" "13" "0" "1"    "2"  
[3,] "0/1" "15" "54" "54" "49" "5"  "9.26%"  "2.8371E-2" "35" "33" "49" "0" "5"  "0" "1"    "3" 

The 'for-loop' problem : “循环”问题

The output is perfect, but now I can't for the life of me figure out how to make a for-loop that encompasses the above code to create a separate matrix for each sample. 输出是完美的,但是现在我无法解决这个问题,我不知道该如何创建一个包含以上代码的for循环,以便为每个样本创建单独的矩阵。 I reasoned: 我说:

for(i in names(data){
    split[i] <-strsplit(as.character(data$[i]),":")
    split[i] <- unlist(split[i])
    mtx[i]a <- matrix(split2, ncol=14, nrow=3,  
[etc etc..]
}       

The problem is that I need to create customized individual variables to set up each matrix for each of the samples (ie the columns). 问题是我需要创建自定义的单个变量来为每个样本(即列)设置每个矩阵。 However, R will not take [i] as a place-holder, where i = the sample(/column) name. 但是,R不会将[i]用作占位符,其中i =样本(/列)名称。

Ideally, each sample(/column) specific variable would read as: "splitSample1", "splitSample2", "splitSample3", etc. This is mainly to allow the for-loop to process all the columns without having to recreate code specific for each column name. 理想情况下,每个样本(/列)特定的变量应读为:“ splitSample1”,“ splitSample2”,“ splitSample3”等。这主要是为了允许for循环处理所有列,而不必重新创建针对每个列的代码列名。 I guess what I am trying to do is recreate the "$i" syntax from Linux, but obviously that doesn't work here. 我猜我想做的是从Linux重新创建“ $ i”语法,但是显然在这里不起作用。

Resolving this issue will make working with very large data sets much more manageable, and I have really tried searching for work-arounds. 解决此问题将使处理非常大的数据集更加容易管理,我确实尝试了寻找解决方法。 Any help is much appreciated! 任何帮助深表感谢!

I think it is better to store the results in a data.frame or data.table as the class type are different for each split column. 我认为最好将结果存储在data.framedata.table ,因为每个拆分列的class类型都不同。 matrix can store only a single class. matrix只能存储一个类。 If there is a single character column, the class will be character for all the columns . 如果只有一个character列,则该类将成为所有columns character

Using the devel version of data.table , we can use tstrsplit to split into columns as well as change the class with type.convert=TRUE . 使用data.tabledevel版本,我们可以使用tstrsplit拆分为列,并使用type.convert=TRUE更改class The devel version can be installed from here 开发版本可以从here安装

library(data.table)#v1.9.5+
nm1 <- c('GT', 'GQ', 'SDP', 'DP', 'RD', 'AD', 'FREQ', 'PVAL', 'RBQ',
   'ABQ', 'RDF', 'RDR', 'ADF', 'ADR')

setDT(data)[, (nm1):=tstrsplit(Sample1, ':', type.convert=TRUE)][,
         Sample1:=NULL][, c('sample', 'locus'):= list(1, 1:3)][]
#    GT GQ SDP DP RD AD   FREQ       PVAL RBQ ABQ RDF RDR ADF ADR sample locus
#1: 0/1 15  55 54 49  5  9.26% 2.8371e-02  37  36  49   0   5   0      1     1
#2: 0/1 42  55 53 40 13 24.53% 5.2873e-05  34  37  40   0  13   0      1     2
#3: 0/1 15  54 54 49  5  9.26% 2.8371e-02  35  33  49   0   5   0      1     3

If there are multiple 'Sample' columns in the dataset, we can use lapply to loop over the columns and create the split datasets in a list ('lst'). 如果数据集中有多个“样本”列,我们可以使用lapply遍历这些列,并在列表中创建拆分数据集(“ lst”)。

nm2 <- paste0('splitSample', 1:ncol(data2))
lst <- setNames(
       lapply(seq_len(ncol(data2)), function(i)
          setDT(list(data2[,i]))[, (nm1) := tstrsplit(V1, ":", 
             type.convert=TRUE)][, V1:=NULL][,
               c('sample', 'locus'):= list(i, 1:.N)]), 
                 nm2)

It would be easier to work in a 'list', but if we need to have separate dataset objects in the global environment (not recommended), we can use list2env . 在“列表”中工作会更容易,但是如果我们需要在全局环境中使用单独的数据集对象(不推荐),则可以使用list2env

list2env(lst, envir=.GlobalEnv)
splitSample1
#    GT GQ SDP DP RD AD   FREQ      PVAL RBQ ABQ RDF RDR ADF ADR sample locus
#1: 0/1 15  55 54 49  5  9.26% 2.8371E-2  37  36  49   0   5   0      1     1
#2: 0/1 42  55 53 40 13 24.53% 5.2873E-5  34  37  40   0  13   0      1     2
#3: 0/1 15  54 54 49  5  9.26% 2.8371E-2  35  33  49   0   5   0      1     3

splitSample2
#    GT GQ SDP DP RD AD   FREQ      PVAL RBQ ABQ RDF RDR ADF ADR sample locus
#1: 0/2 15  55 55 49  5 10.26%  2.971E-2  37  32  49   0   5   0      2     1
#2: 0/2 52  55 53 40 13 22.53% 1.2873E-5  34  37  12   0  13   0      2     2
#3: 0/2 17  54 54 49 18  9.29% 3.8371E-2  42  33  49   0   5   0      2     3

NOTE: Here, I used the input dataset as a data.frame. 注意:在这里,我将输入数据集用作data.frame。

data 数据

data <- structure(list(Sample1 =
   c("0/1:15:55:54:49:5:9.26%:2.8371E-2:37:36:49:0:5:0", 
 "0/1:42:55:53:40:13:24.53%:5.2873E-5:34:37:40:0:13:0",
  "0/1:15:54:54:49:5:9.26%:2.8371E-2:35:33:49:0:5:0"
 )), .Names = "Sample1", class = "data.frame", row.names = c(NA, -3L))


 data2 <- structure(list(Sample1 =
   c("0/1:15:55:54:49:5:9.26%:2.8371E-2:37:36:49:0:5:0", 
  "0/1:42:55:53:40:13:24.53%:5.2873E-5:34:37:40:0:13:0",
  "0/1:15:54:54:49:5:9.26%:2.8371E-2:35:33:49:0:5:0"
 ), Sample2 = c("0/2:15:55:55:49:5:10.26%:2.971E-2:37:32:49:0:5:0", 
 "0/2:52:55:53:40:13:22.53%:1.2873E-5:34:37:12:0:13:0",
 "0/2:17:54:54:49:18:9.29%:3.8371E-2:42:33:49:0:5:0")),
.Names = c("Sample1", "Sample2"), class = "data.frame",
row.names = c(NA, -3L))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM