[英]saving a text file as RData and import it in R
SNP filtering for missing and redundant markers SNP筛选缺失和冗余标记
I split a large text file (30 GB in size) to small 40 files in the cluster and saved it as RData. 我将一个大文本文件(大小为30 GB)拆分为集群中的40个小文件,并将其另存为RData。 Then I am importing these small RData files to R to filter these for missing and redundant SNP markers. 然后,我将这些小的RData文件导入到R中,以过滤掉缺少和冗余的SNP标记。 But it is giving an error. 但这给出了错误。
I want to split large files to small files, save them as RData, import to R and filtering for missing and redundant markers. 我想将大文件拆分为小文件,将它们另存为RData,导入到R并过滤丢失和冗余的标记。
RData file is a binary file and your splited files are tab separated text file. RData文件是二进制文件,拆分的文件是制表符分隔的文本文件。 So, they cannot be load with load()
. 因此,不能用load()
加载它们。 I will suggest removing all header information and column names before you split the vcf file. 在拆分vcf文件之前,我建议删除所有标题信息和列名。 And load the files into r with read.table(sep='\\t')
. 并使用read.table(sep='\\t')
将文件加载到r中。 Hope the above can help. 希望以上内容能对您有所帮助。
I don't have any idea about how to save text file to RData file outside r environment. 我不知道如何在r环境之外将文本文件保存到RData文件。 But, if you concern the big file size of your vcf file and you want to filter the data before loading, I suggest using read_tsv_chunked
which can load every chunk ( chunk_size =
) of data and do some process on each chunk of data, and finally available in r environment. 但是,如果您担心vcf文件的大文件大小,并且想要在加载之前过滤数据,我建议使用read_tsv_chunked
,它可以加载每个数据块( chunk_size =
),并对每个数据块进行一些处理,最后在r环境中可用。 And beware of skip
the number of header line before column names. 并请注意skip
列名之前的标题行数。 The following are the script. 以下是脚本。
library(tidyverse)
filterFun <- function(df, pos) {
df <- unique(df)
count.nas <- rowSums(is.na(df))
df[-which(count.nas>(0.05*ncol(df))),]
}
data <- read_tsv_chunked("xxx.vcf", skip = 22, chunk_size = 10000, callback = DataFrameCallback$new(filterFun))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.