简体   繁体   English

R中基因型的设计矩阵

[英]Design matrix for genotypes in R

I'm looking for an efficient way to create a "parametrized" design matrix for genotypes in R. I have a big file (about 3 gb), with animals and their genotypes.我正在寻找一种有效的方法来为 R 中的基因型创建“参数化”设计矩阵。我有一个大文件(大约 3 gb),其中包含动物及其基因型。 Sample data looks like this:示例数据如下所示:

snp id a1 a2 code
snp1 an1 A A 0
snp1 an2 A B 1
snp1 an3 B B -1
snp2 an1 A B 1
snp2 an2 A A 0
snp2 an3 B B -1

snp is name of snp (each animal have each snp), id is animal's id (each animal has unique id), a1 is allele 1, a2 is allele 2, code denotes genotype based on alleles, if animal has two A's it's code is 0, if animal has AB, it's code is -1, and if it's BB the code is 1. snp是snp的名字(每只动物都有一个snp),id是动物的id(每只动物都有唯一的id),a1是等位基因1,a2是等位基因2,code表示基于等位基因的基因型,如果动物有两个A,它的代码是0,如果动物有AB,它的代码是-1,如果它是BB,代码是1。

Now I need to create based on that design matrix, which in row's will have animal's (id column in data), and in columns SNP's (snp column in data) and in the "cell" (at the intersection of the row and column) I need value from code column.现在我需要根据该设计矩阵进行创建,该矩阵在行中将有动物(数据中的 id 列),在列中为 SNP(数据中的 snp 列)和“单元格”(在行和列的交叉处)我需要代码列中的值。 So at the end, it should look like that:所以最后,它应该是这样的:

an1 0 1
an2 1 0
an3 -1 -1

I know that in the case of efficiency and speed R has a limitation, but still, I need the fastest solution for this I can obtain in R.我知道在效率和速度的情况下,R 有限制,但是,我仍然需要在 R 中可以获得的最快解决方案。

Usually the data.table package is pretty performant in these type of cases.通常 data.table 包在这些类型的情况下性能非常好。 Example below:下面的例子:

library(data.table)
#> Warning: package 'data.table' was built under R version 4.1.1

df <- fread(text = "snp id a1 a2 code
snp1 an1 A A 0
snp1 an2 A B 1
snp1 an3 B B -1
snp2 an1 A B 1
snp2 an2 A A 0
snp2 an3 B B -1")

dcast(df, id ~ snp, value.var = "code")
#>     id snp1 snp2
#> 1: an1    0    1
#> 2: an2    1    0
#> 3: an3   -1   -1

Created on 2021-10-13 by the reprex package (v2.0.1)reprex 包(v2.0.1) 于 2021 年 10 月 13 日创建

If you need the output as a matrix you could use:如果您需要将输出作为矩阵,您可以使用:

cast <- dcast(df, id ~ snp, value.var = "code")
mat <- as.matrix(cast[, -"id"])
rownames(mat) <- cast$id
mat
#>     snp1 snp2
#> an1    0    1
#> an2    1    0
#> an3   -1   -1

For a ~3Gb file you might expect this to run for about 10 seconds:对于 ~3Gb 文件,您可能希望它运行大约 10 秒:

library(data.table)
#> Warning: package 'data.table' was built under R version 4.1.1

# Setting up larger data
df <- expand.grid(
  snp = paste0("snp", 1:10000),
  id  = paste0("an", 1:10000)
)
df$a1 <- sample(c("A", "B"), nrow(df), replace = TRUE)
df$a2 <- sample(c("A", "B"), nrow(df), replace = TRUE)
df$code <- with(df, dplyr::case_when(
  a1 == "A" & a2 == "A" ~ 0,
  a1 == "B" & a2 == "B" ~ -1,
  TRUE ~ 1
))
setDT(df)

# How big is this data?
format(object.size(df), "Gb")
#> [1] "3 Gb"

# How fast does the function run?
bench::mark(
  dcast(df, id ~ snp, value.var = "code")
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 1 x 6
#>   expression                                   min   median `itr/sec` mem_alloc
#>   <bch:expr>                              <bch:tm> <bch:tm>     <dbl> <bch:byt>
#> 1 dcast(df, id ~ snp, value.var = "code")    9.32s    9.32s     0.107    6.71GB
#> # ... with 1 more variable: gc/sec <dbl>

Created on 2021-10-13 by the reprex package (v2.0.1)reprex 包(v2.0.1) 于 2021 年 10 月 13 日创建

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM