简体   繁体   English

将文本文件另存为RData并将其导入R

[英]saving a text file as RData and import it in R

SNP filtering for missing and redundant markers SNP筛选缺失和冗余标记

I split a large text file (30 GB in size) to small 40 files in the cluster and saved it as RData. 我将一个大文本文件(大小为30 GB)拆分为集群中的40个小文件,并将其另存为RData。 Then I am importing these small RData files to R to filter these for missing and redundant SNP markers. 然后,我将这些小的RData文件导入到R中,以过滤掉缺少和冗余的SNP标记。 But it is giving an error. 但这给出了错误。

I want to split large files to small files, save them as RData, import to R and filtering for missing and redundant markers. 我想将大文件拆分为小文件,将它们另存为RData,导入到R并过滤丢失和冗余的标记。

RData file is a binary file and your splited files are tab separated text file. RData文件是二进制文件,拆分的文件是制表符分隔的文本文件。 So, they cannot be load with load() . 因此,不能用load()加载它们。 I will suggest removing all header information and column names before you split the vcf file. 在拆分vcf文件之前,我建议删除所有标题信息和列名。 And load the files into r with read.table(sep='\\t') . 并使用read.table(sep='\\t')将文件加载到r中。 Hope the above can help. 希望以上内容能对您有所帮助。

I don't have any idea about how to save text file to RData file outside r environment. 我不知道如何在r环境之外将文本文件保存到RData文件。 But, if you concern the big file size of your vcf file and you want to filter the data before loading, I suggest using read_tsv_chunked which can load every chunk ( chunk_size = ) of data and do some process on each chunk of data, and finally available in r environment. 但是,如果您担心vcf文件的大文件大小,并且想要在加载之前过滤数据,我建议使用read_tsv_chunked ,它可以加载每个数据块( chunk_size = ),并对每个数据块进行一些处理,最后在r环境中可用。 And beware of skip the number of header line before column names. 并请注意skip列名之前的标题行数。 The following are the script. 以下是脚本。

library(tidyverse)

filterFun <- function(df, pos) {
  df <- unique(df)
  count.nas <- rowSums(is.na(df))
  df[-which(count.nas>(0.05*ncol(df))),]
}

data <- read_tsv_chunked("xxx.vcf", skip = 22, chunk_size = 10000, callback = DataFrameCallback$new(filterFun))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM