将文本文件另存为RData并将其导入R

Question

SNP filtering for missing and redundant markers SNP筛选缺失和冗余标记

I split a large text file (30 GB in size) to small 40 files in the cluster and saved it as RData. 我将一个大文本文件（大小为30 GB）拆分为集群中的40个小文件，并将其另存为RData。 Then I am importing these small RData files to R to filter these for missing and redundant SNP markers. 然后，我将这些小的RData文件导入到R中，以过滤掉缺少和冗余的SNP标记。 But it is giving an error. 但这给出了错误。

I want to split large files to small files, save them as RData, import to R and filtering for missing and redundant markers. 我想将大文件拆分为小文件，将它们另存为RData，导入到R并过滤丢失和冗余的标记。

Answer 1

RData file is a binary file and your splited files are tab separated text file. RData文件是二进制文件，拆分的文件是制表符分隔的文本文件。 So, they cannot be load with load() . 因此，不能用load()加载它们。 I will suggest removing all header information and column names before you split the vcf file. 在拆分vcf文件之前，我建议删除所有标题信息和列名。 And load the files into r with read.table(sep='\\t') . 并使用read.table(sep='\\t')将文件加载到r中。 Hope the above can help. 希望以上内容能对您有所帮助。

I don't have any idea about how to save text file to RData file outside r environment. 我不知道如何在r环境之外将文本文件保存到RData文件。 But, if you concern the big file size of your vcf file and you want to filter the data before loading, I suggest using read_tsv_chunked which can load every chunk ( chunk_size = ) of data and do some process on each chunk of data, and finally available in r environment. 但是，如果您担心vcf文件的大文件大小，并且想要在加载之前过滤数据，我建议使用read_tsv_chunked ，它可以加载每个数据块（ chunk_size = ），并对每个数据块进行一些处理，最后在r环境中可用。 And beware of skip the number of header line before column names. 并请注意skip列名之前的标题行数。 The following are the script. 以下是脚本。

library(tidyverse)

filterFun <- function(df, pos) {
  df <- unique(df)
  count.nas <- rowSums(is.na(df))
  df[-which(count.nas>(0.05*ncol(df))),]
}

data <- read_tsv_chunked("xxx.vcf", skip = 22, chunk_size = 10000, callback = DataFrameCallback$new(filterFun))

将文本文件另存为RData并将其导入R

问题描述

1 个解决方案

解决方案1
0 2019-09-04 00:00:40

将文本文件另存为RData并将其导入R

问题描述

1 个解决方案

解决方案1 0 2019-09-04 00:00:40

解决方案1
0 2019-09-04 00:00:40