简体   繁体   English

如何使用 sqlite 数据库填充 bigstatsr::FBM 以供以后使用?

[英]How to populate bigstatsr::FBM with sqlite database for later consumption?

I'm a newbie to the bigstatsr package.我是 bigstatsr package 的新手。 I have a sqlite database which I want to convert to an FBM matrix of 40k rows (genes) 60K columns (samples) for later consumption.我有一个 sqlite 数据库,我想将其转换为 40k 行(基因)60K 列(样本)的 FBM 矩阵供以后使用。 I found examples of how to populate the matrix with random values but I'm not sure of what would be the best way to populate it with values from my sqlite database.我找到了如何用随机值填充矩阵的示例,但我不确定用我的 sqlite 数据库中的值填充矩阵的最佳方法是什么。

Currently I do it sequentially, here's some mock code:目前我按顺序执行,这里有一些模拟代码:

library(bigstatsr)
library(RSQLite)
library(dplyr)

number_genes <- 50e3
number_samples <- 70e3

large_genomic_matrix <- bigstatsr::FBM(nrow = number_genes, 
                                       ncol = number_samples, 
                                       type = "double", 
                                       backingfile = "fbm_large_genomic_matrix")

# Code to get a single df at the time
database_connection <- dbConnect(RSQLite::SQLite(), "database.sqlite")


sample_index_counter <- 1

for(current_sample in vector_with_sample_names){
  
  sqlite_df <- DBI::dbListTables(conn = database_connection) %>%
    dplyr::tbl("genomic_data") %>%
    dplyr::filter(sample == current_sample) %>% 
    dplyr::collect()
  
  large_genomic_matrix[, sample_index_counter] <- sqlite_df$value
  sample_index_counter <- sample_index_counter + 1
  
}

big_write(large_genomic_matrix, "large_genomic_matrix.out", every_nrow = 1000, progress = interactive())

I have two questions:我有两个问题:

  1. Is there a way of populating the matrix more efficiently?有没有办法更有效地填充矩阵? Not sure if big_apply could be used here, perhaps foreach不确定是否可以在这里使用 big_apply,也许是 foreach
  2. Do I always have to use big_write in order to load my matrix later?我是否总是必须使用 big_write 以便稍后加载我的矩阵? If so why can't I just use the bk file?如果是这样,为什么我不能只使用 bk 文件?

Thanks in advance提前致谢

That is a very good first try that you have by yourself.这是您自己进行的非常好的第一次尝试。

  1. What is inefficient here is to test for dplyr::filter(sample == current_sample) for every single sample.这里低效的是为每个样本测试dplyr::filter(sample == current_sample) I would try to use match() first to get the indices.我会尝试首先使用match()来获取索引。 Then, what would be a bit inefficient is to populate each column individually.然后,有点低效的是单独填充每一列。 As you said, you could use big_apply() to do this by blocks.正如您所说,您可以使用big_apply()逐块执行此操作。

  2. big_write() is for writing the FBM to some text file (eg csv). big_write()用于将 FBM 写入某个文本文件(例如 csv)。 What you want here is to use FBM()$save() (second line of the example in the README), and then use big_attach() on the.rds file (next line of the README).您在这里想要的是使用FBM()$save() (自述文件中示例的第二行),然后在 .rds 文件(自述文件的下一行)上使用big_attach() )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM