简体   繁体   English

编辑行并从文件中删除行

[英]Editing lines and removing lines from file

I have a file of accession numbers and 16S rrna sequences, and what I'm trying to do is remove all lines of RNA, and only keep the lines with the accession numbers and the species name (and remove all the junk in between). 我有一个登录号和16S rrna序列的文件,我想要做的是删除所有的RNA行,只保留具有登录号和物种名称的行(并删除其间的所有垃圾)。 So my input file looks like this (there are > in front of the accession numbers): 所以我的输入文件看起来像这样(加入号前面有>):

> D50541 1 1409 1409bp rna Abiotrophia defectiva Aerococcaceae > D50541 1 1409 1409bp rna Abiotrophia defectiva Aerococcaceae
CUGGCGGCGU GCCUAAUACA UGCAAGUCGA ACCGAAGCAU CUUCGGAUGC UUAGUGGCGA ACGGGUGAGU AACACGUAGA UAACCUACCC UAGACUCGAG GAUAACUCCG GGAAACUGGA GCUAAUACUG GAUAGGAUAU AGAGAUAAUU UCUUUAUAUU (... and many more lines) CUGGCGGCGU GCCUAAUACA UGCAAGUCGA ACCGAAGCAU CUUCGGAUGC UUAGUGGCGA ACGGGUGAGU AACACGUAGA UAACCUACCC UAGACUCGAG GAUAACUCCG GGAAACUGGA GCUAAUACUG GAUAGGAUAU AGAGAUAAUU UCUUUAUAUU(......以及更多行)

> AY538167 1 1411 1411bp rna Acholeplasma hippikon Acholeplasmataceae > AY538167 1 1411 1411bp rna Acholeplasma hippikon Acholeplasmataceae
CUGGCGGCGU GCCUAAUACA UGCAAGUCGA ACGCUCUAUA GCAAUAUAGG GAGUGGCGAA CGGGUGAGUA ACACGUAGAU AACCUACCCU UACUUCGAGG AUAACUUCGG GAAACUGGAG CUAAUACUGG AUAGGACAUA UUGAGGCAUC UUAAUAUGUU ... CUGGCGGCGU GCCUAAUACA UGCAAGUCGA ACGCUCUAUA GCAAUAUAGG GAGUGGCGAA CGGGUGAGUA ACACGUAGAU AACCUACCCU UACUUCGAGG AUAACUUCGG GAAACUGGAG CUAAUACUGG AUAGGACAUA UUGAGGCAUC UUAAUAUGUU ...

I want my output to look like this: 我希望我的输出看起来像这样:

>D50541 Abiotrophia defectiva Aerococcaceae > D50541 Abiotrophia defectiva Aerococcaceae

>AY538167 Acholeplasma hippikon Acholeplasmataceae > AY538167 Acholeplasma hippikon Acholeplasmataceae

The code I wrote does what I want... for most of the lines. 我写的代码做了我想要的......对于大多数行。 It looks like this: 它看起来像这样:

    #!/usr/bin/env python

    # take LTPs111.compressed fasta and reduce to accession numbers with names.
    import re
    infilename = 'LTPs111.compressed.fasta'
    outfilename = 'acs.fasta'

    regex = re.compile(r'(>)\s(\w+).+[rna]\s+([A-Z].+)')    

    #remove extra letters and spaces
    with open(infilename, 'r') as infile, open(outfilename, 'w') as outfile:
        for line in infile:
            x = regex.sub(r'\1\2 \3', line)
    #remove rna sequences
        for line in x:
            if '>' in line:
                outfile.write(x)

Sometimes, the code seems to skip over some of the names. 有时,代码似乎跳过了一些名称。 for example, for the first accession number above, I only got back: 例如,对于上面的第一个入藏号,我只回来了:

>D50541 Aerococcaceae > D50541 Aerococcaceae

Why might my code be doing this? 为什么我的代码可能会这样做? The input for each accession number looks identical, and the spacing between 'rna' and the first name is the same for each line (5 spaces). 每个登录号的输入看起来相同,“rna”和第一个名称之间的间距对于每一行(5个空格)是相同的。

Thank you to anyone who might have some ideas! 感谢任何可能有想法的人!

I still haven't been able to run your code to get the claimed results, but I think I know what the problem is: 我仍然无法运行您的代码来获取声明的结果,但我想我知道问题是什么:

>>> line = '> AY538167 1 1411 1411bp rna Acholeplasma hippikon Acholeplasmataceae'
>>> regex = re.compile(r'(>)\s(\w+).+[rna]\s+([A-Z].+)')
>>> regex.findall(line)
[('>', 'AY538167', 'Acholeplasmataceae')]

The problem is that [rna]\\s+ matches any one of the characters r , n , or a at the end of a word. 问题是[rna]\\s+匹配单词末尾的任何一个字符 rna And, because all of the matches are greedy, with no lookahead or anything else to prevent it, this means that it matches the n at the end of hippikon . 而且,由于所有的比赛都是贪婪的,没有超前或其他任何东西,以防止它,这意味着它的匹配n在年底hippikon

The simple solution is to remove the brackets, so it matches the string rna : 简单的解决方案是删除括号,因此它匹配字符串 rna

>>> regex = re.compile(r'(>)\s(\w+).+rna\s+([A-Z].+)')

That won't work if any of your species or genera can end with that string. 如果您的任何物种或属可以以该字符串结束,那将无效。 Are there any such names? 有没有这样的名字? If so, you need to come up with a better way to describe the cutoff between the 1409bp part and the rna part. 如果是这样,你需要提出一种更好的方法来描述1409bp部分和rna部分之间的截止。 The simplest may be to just look for rna surrounded by spaces: 最简单的可能是寻找被空格包围的rna

>>> regex = re.compile(r'(>)\s(\w+).+\s+rna\s+([A-Z].+)')

Whether this is actually correct or not, I can't say without knowing more about the format, but hopefully you understand what I'm doing well enough to verify that it's correct (or at least to ask smarter questions than I can ask). 无论这是否真的是正确的,我不能不了解更多关于格式的内容,但希望你能理解我做得很好,以确认它是正确的(或者至少提出比我能提出的更聪明的问题)。


It may help debug things to add capture groups. 它可能有助于调试事物以添加捕获组。 For example, instead of this: 例如,而不是这样:

(>)\s(\w+).+[rna]\s+([A-Z].+)

… search for this: ...搜索这个:

(>)(\s)(\w+)(.+[rna]\s+)([A-Z].+)

Obviously your desired capture groups are now \\1\\3 \\5 instead of \\1\\2 \\3 … but the big thing is that you can see what got matched in \\4 : 显然你想要的捕获组现在是\\1\\3 \\5而不是\\1\\2 \\3 ......但最重要的是你可以看到\\4匹配了什么:

[('>', ' ', 'AY538167', ' 1 1411 1411bp Acholeplasma hippikon ', 'Acholeplasmataceae')]

So, now the question is "Why did .+[rna]\\s+ match '1 1411 1411bp Acholeplasma hippikon ' ? Sometimes the context matters, but in this case, it doesn't. You don't want that group to match that string in any context, and yet it will always match it, so that's the part you have to debug. 那么,现在的问题是“为什么.+[rna]\\s+匹配'1 1411 1411bp Acholeplasma hippikon ' ?有时上下文很重要,但在这种情况下,它没有。你不希望那个组匹配那个在任何上下文中的字符串,但它总是匹配它,所以这是你必须调试的部分。


Also, a visual regexp explorer often helps a lot. 此外,视觉正则表达式资源管理器经常帮助很多。 The best ones can color parts of the expression and the matched text, etc., to show you how and why the regexp is doing what it does. 最好的可以为表达式的部分颜色和匹配的文本等着色,以向您展示正则表达式如何以及为什么要执行它所做的事情。

Of course you're limited by those that run on your platform or online, and work with Python syntax. 当然,您受限于在您的平台或在线上运行的那些,并使用Python语法。 If you're careful and/or only use simple features (as in your example), perl/PCRE syntax is very close to Python, and JavaScript/ActionScript is also pretty close (the one big difference to keep in mind is that replace/sub uses $ instead of \\1 ). 如果你小心和/或只使用简单的功能(如你的例子),perl / PCRE语法非常接近Python,而JavaScript / ActionScript也非常接近(要记住的一个很大的区别就是替换/ sub使用$而不是\\1 )。

I don't have a good online one to strongly recommend, but from a quick glance Debuggex looks pretty cool. 我没有强烈推荐的在线版,但从快速浏览一下, Debuggex看起来很酷。

Items between brackets are character classes, so by setting your regex to look for "[rna]" you are requesting lines with either r, n, or a, but not all three. 括号中的项目是字符类,可通过设置你的正则表达式来寻找“[RNA]”你要请求与任何 R,N或A,但不是所有的三条线。

Further, if the lines you want all have the pattern "bp rna", I'd use that to yank those lines. 此外,如果您想要的线条都具有“bp rna”模式,我会用它来拉动这些线条。 By reading the file in line by line, the following worked for me for a quick and dirty line-yanker, for instance: 通过逐行读取文件,以下内容对我来说是一个快速而肮脏的线条,例如:

regex = re.compile(r'^[\w\s]+bp rna .*$')

But, again, if it's as simple as finding lines with "bp rna" in them, you could read the file line by line and forego regex entirely: 但是,再次,如果它就像在其中查找带有“bp rna”的行一样简单,您可以逐行读取文件并完全放弃正则表达式:

for line in file:
   if "bp rna" in line:
     print(line) 

EDIT: I blew it by not reading the request carefully enough. 编辑:我没有仔细阅读请求,我吹了它。 Maybe a capture-and-replace regex would help? 也许捕获和替换正则表达式会有所帮助?

for line in file:
  if "bp rna" in line:
    subreg = re.sub(r'^(>[\w]+)\s[\d\s]+bp\srna\s([\w\s]+$)', r"\1 \2", line)
    print(subreg)

OUTPUT: OUTPUT:

>AY538166 Acholeplasma granularum Acholeplasmataceae

>AY538167 Acholeplasma hippikon Acholeplasmataceae

This should match any whitespace (tabs or spaces) between the things you want. 这应该匹配您想要的事物之间的任何空格(制表符或空格)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM