简体   繁体   English

是否有任何 python/unix 命令专门用于在读取文件后删除一行? (我正在处理 64.2 GB 文件)

[英]Is there any python/unix command specifically to delete a line after reading a file? (I'm dealing with 64.2 GB file)

I'm a bioinformatician and I'm dealing with a very large text file.我是一名生物信息学家,我正在处理一个非常大的文本文件。 The size of the file is 64.2 GB.文件大小为 64.2 GB。 I did word count on my text file and got this result after quite some time.我对我的文本文件进行了字数统计,并在相当长的一段时间后得到了这个结果。

1052454251 1052456168 64199706147 GRCh38.fa 1052454251 1052456168 64199706147 GRCh38.fa

My problem is I want to delete only few lines from this file.我的问题是我只想从这个文件中删除几行。 I searched in google for python commands to delete any specific line from text file, but almost all of them are suggesting writing the complete file while skipping only the part which we want to delete.我在谷歌搜索 python 命令以从文本文件中删除任何特定行,但几乎所有命令都建议编写完整文件,同时只跳过我们要删除的部分。

I followed the same approach even before referring google, but initially I encountered memory issues (because I used readlines() function and tried to read the entire file at one stretch, so my system got hanged).甚至在参考 google 之前,我也采用了相同的方法,但最初我遇到了内存问题(因为我使用了 readlines() 函数并试图一次性读取整个文件,所以我的系统被挂了)。 Then I read the file one line at a time and filtered the lines which I need.然后我一次读取一行文件并过滤我需要的行。 However, this approach is very time consuming since my input file is 64 GB.但是,这种方法非常耗时,因为我的输入文件是 64 GB。

Can anyone please suggest if there is any specific python/unix command to delete only specific lines from a file?任何人都可以建议是否有任何特定的 python/unix 命令可以从文件中只删除特定的行?

For more information about my problem, my input file contains a complete human genome sequence in fasta format.有关我的问题的更多信息,我的输入文件包含 fasta 格式的完整人类基因组序列。 It looks like this看起来像这样

>chr1 and some description >chr1 和一些描述
ATTAGATCGGCTGATG.... ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG.... ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG.... ATTAGATCGGCTGATG....
>chr2 and some description >chr2 和一些描述
ATTAGATCGGCTGATG.... ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG.... ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG.... ATTAGATCGGCTGATG....
>chr3 and some description >chr3 和一些描述
ATTAGATCGGCTGATG.... ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG.... ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG.... ATTAGATCGGCTGATG....
. .
. .
. .
>chr22 and some description >chr22 和一些描述
ATTAGATCGGCTGATG.... ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG.... ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG.... ATTAGATCGGCTGATG....
>chrM and some description >chrM 和一些描述
ATTAGATCGGCTGATG.... ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG.... ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG.... ATTAGATCGGCTGATG....
>chrX and some description >chrX 和一些描述
ATTAGATCGGCTGATG.... ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG.... ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG.... ATTAGATCGGCTGATG....
>chrY and some description >chrY 和一些描述
ATTAGATCGGCTGATG.... ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG.... ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG.... ATTAGATCGGCTGATG....

Here the line starting with ">" is called header.这里以“>”开头的行称为标题。 the rest of the lines contains the actual DNA sequence.其余的行包含实际的 DNA 序列。

Immediately after ">" I have chromosome name (chr1 or chr2 or chr3 or .... chrM or chrX or chrY).紧接在“>”之后,我有染色体名称(chr1 或 chr2 或 chr3 或 .... chrM 或 chrX 或 chrY)。 I just want to delete the line chrM and the DNA sequence lines below them.我只想删除 chrM 行和它们下面的 DNA 序列行。 So my output file should look like所以我的输出文件应该看起来像

>chr1 and some description >chr1 和一些描述
ATTAGATCGGCTGATG.... ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG.... ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG.... ATTAGATCGGCTGATG....
>chr2 and some description >chr2 和一些描述
ATTAGATCGGCTGATG.... ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG.... ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG.... ATTAGATCGGCTGATG....
>chr3 and some description >chr3 和一些描述
ATTAGATCGGCTGATG.... ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG.... ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG.... ATTAGATCGGCTGATG....
. .
. .
. .
. .
>chr22 and some description >chr22 和一些描述
ATTAGATCGGCTGATG.... ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG.... ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG.... ATTAGATCGGCTGATG....
>chrX and some description >chrX 和一些描述
ATTAGATCGGCTGATG.... ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG.... ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG.... ATTAGATCGGCTGATG....
>chrY and some description >chrY 和一些描述
ATTAGATCGGCTGATG.... ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG.... ATTAGATCGGCTGATG....
ATTAGATCGGCTGATG.... ATTAGATCGGCTGATG....

Here is the code that I wrote.这是我写的代码。

from memory_profiler import profile  

@profile                       #This is just to check how much memory is used.
def memcheck():
    g=open("chrMremoved.fa",'w')
    to_write=1
    with open("GRCh38.fa") as f:
        for i in f:
            if(i[0]==">"):
                if(i[0:5]==">chrM"):
                    to_write=0
                else:
                    to_write=1
            if(to_write==1):
                g.write(i)
    g.close()

if __name__ == "__main__":
    memcheck()

Also since most of my work revolves around analysing huge data set.此外,由于我的大部分工作都围绕着分析庞大的数据集。 It'll be helpful for me if someone suggest me some tips on how to deal with large data sets in python (like writing memory efficient and time efficient code).如果有人建议我一些关于如何在 python 中处理大型数据集的技巧(比如编写内存高效和时间高效的代码),这对我会有帮助。

Some times I have encountered issues where my python code is getting killed, I searched google and found out that it's because of less memory in RAM.有时我遇到了我的 python 代码被杀死的问题,我搜索了谷歌,发现这是因为 RAM 中的内存较少。 Please guide me what I can do in such situations.请指导我在这种情况下我能做些什么。

使用像seqkit这样的专用工具按标题模式过滤您的序列。

seqkit grep -v -p chrM GRCh38.fa -o chrMremoved.fa

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM