简体   繁体   English

如何在不使用 range() python 的情况下删除文本文档的特定部分

[英]How to remove a specific part of a text document without using range() python

Basically, I am given a text document that contains a header (about 8 lines) and then a bunch of lines of DNA sequence.基本上,我得到了一个包含标题(大约 8 行)和一堆 DNA 序列行的文本文档。 I can figure out all of the solution I need to solve my particular issue except how to extract the DNA sequences from the header.除了如何从标题中提取 DNA 序列之外,我可以找出解决我的特定问题所需的所有解决方案。 I was able to put everything in a list using我能够使用

dna = open("dna_sequence.txt").read().split('\n')

that was able to successfully take every individual line and put them all in a list.能够成功地获取每一行并将它们全部放在一个列表中。 which is what i want.这就是我想要的。 however the first 8 items in the list are garbage essentially and i need to remove them from the rest of the list without using like .pop() or slice or creating the list from a range.然而,列表中的前 8 个项目本质上是垃圾,我需要将它们从列表的其余部分中删除,而不使用像 .pop() 或 slice 或从范围创建列表。

the only module we are allowed to use for this assignment is pandas, but we havent gone over it that much yet, and i am not familiar with it.我们被允许用于此作业的唯一模块是 pandas,但我们还没有深入了解它,我也不熟悉它。 I know it can be done without that module.我知道它可以在没有那个模块的情况下完成。

Okay, so from the comments here is more information, sorry I didnt include it, didnt know it would be important :D好的,所以从这里的评论中可以获得更多信息,抱歉我没有包含它,不知道这很重要:D

LOCUS: SCU49845
ACCESSION: U49845
ORGANISM: Saccharomyces cerevisiae (baker's yeast)          
AUTHORS: Roemer,T., Madden,K., Chang,J. and Snyder,M.
TITLE: Selection of axial growth sites in yeast requires Axl2p, a novel plasma membrane glycoprotein
JOURNAL: Genes Dev. 10 (7), 777-793 (1996)
PUBMED: 8846915
SOURCE: https://www.ncbi.nlm.nih.gov/nuccore/U49845.1?report=genbank&to=5028
GATCCTCCATATACAACGGTATCTCCACCTCAGGTTTAGATCTCAACAACGGAACCATTGCCGACATGAG
ACAGTTAGGTATCGTCGAGAGTTACAAGCTAAAACGAGCAGTAGTCAGCTCTGCATCTGAAGCCGCTGAA
GTTCTACTAAGGGTGGATAACATCATCCGTGCAAGACCAAGAACCGCCAATAGACAACATATGTAACATA
TTTAGGATATACCTCGAAAATAATAAACCGCCACACTGTCATTATTATAATTAGAAACAGAACGCAAAAA

So the .txt document I am going to be using looks like this.所以我将要使用的 .txt 文档看起来像这样。 with a bunch for lines of code.用一堆代码行。 I need to remove the parts above the DNA sequence so I am left with a list of just the DNA sequences, doesnt really matter how long the strings are since i will just use a for loop for the next part of the assignment.我需要删除 DNA 序列上方的部分,所以我只剩下 DNA 序列的列表,字符串有多长并不重要,因为我将在分配的下一部分使用 for 循环。

The assignment is to take the DNA sequences and create a single string that contains the complements.任务是获取 DNA 序列并创建一个包含补码的字符串。 Which i can easily do with a for loop since there are only 4 nucleotides and they each only have 1 complement.我可以很容易地用 for 循环来做,因为只有 4 个核苷酸,而且它们每个只有 1 个补码。

He specifically said we can do it with pandas, but since we havent gone over it much he doesnt expect us to know exactly how to do it, and we can do it with just python.他特别说我们可以用pandas来做,但由于我们没有深入研究,他不希望我们确切地知道如何做,我们可以只用python来做。

if i could just do dna.slice(1,9) that would be simple, but he said we cannot do that.如果我能做 dna.slice(1,9) 那会很简单,但他说我们不能那样做。 so i am lost所以我迷路了

If you are just looking to get the dna sequence, you could use a regex to go through the file:如果您只是想获取 dna 序列,您可以使用正则表达式来浏览文件:

import re

with open(somefile) as fh:
     mydna = [line for line in fh if re.match('^[AGCT]+$', line)]

mydna
# ['GATCCTCCATATACAACGGTATCTCCACCTCAGGTTTAGATCTCAACAACGGAACCATTGCCGACATGAG',
# 'ACAGTTAGGTATCGTCGAGAGTTACAAGCTAAAACGAGCAGTAGTCAGCTCTGCATCTGAAGCCGCTGAA',
# 'GTTCTACTAAGGGTGGATAACATCATCCGTGCAAGACCAAGAACCGCCAATAGACAACATATGTAACATA',
# 'TTTAGGATATACCTCGAAAATAATAAACCGCCACACTGTCATTATTATAATTAGAAACAGAACGCAAAAA']

That way you aren't ignoring an arbitrary amount of lines, though this isn't a pandas-specific answer.这样您就不会忽略任意数量的行,尽管这不是特定于熊猫的答案。

To build the complement dna strings, you could use a dictionary to map bases to their complements and iterate over each string like so:要构建补码 dna 字符串,您可以使用字典将碱基映射到它们的补码并迭代每个字符串,如下所示:

mapping = {'A': 'T', 'T': 'A', 'C': 'G', 'G':'C'}

# .get(base, ' ') will either return the value or an empty string
# mapping.get('A', ' ') will return 'T' whereas mapping.get('U', ' ') will 
# return ' '
complements = [''.join(mapping.get(base, ' ') for base in dna) for dna in mydna]

Pandas answer:熊猫回答:

import pandas as pd

df = pd.read_csv(sep="\n", header=None, names = ['code'])

regex = "[^ATCG]+\\b"     # Regex that gets eveything that's not a DNA code.
filter = df['code'].str.contains(regex)
df = df[~filter]          # Keep only the DNA codes.

Okay, just to clarify, and show the answer that i got, just in case someone else has this same question.好的,只是为了澄清,并显示我得到的答案,以防万一其他人有同样的问题。

I was allowed to use re, i checked with my professor.我被允许使用 re,我咨询了我的教授。

but the assignment was, "Define a function, so that, provided an input from a text file (*.txt) would find all DNA sequences and provide the complements"但任务是,“定义一个函数,这样,提供来自文本文件 (*.txt) 的输入将找到所有 DNA 序列并提供补码”

thanks to you guys, and some youtubing/reading this is what i came up with: (i am 100% positive it can probably be cleaned up, but its not due til monday)多亏了你们,还有一些 Youtubing/阅读这就是我想出的:(我 100% 肯定它可能会被清理干净,但它要到星期一才到期)

import re
def dnaMatching(t):
    with open(t) as n:
        dna = [line for line in m if re.match(r'^[AGCT]+$', line)]
    complement = ""
    for i in dna:
        for x in i:
            if x == 'A':
                complement += 'T'
            elif x == 'G':
                complement += 'C'
            elif x == 'C':
                complement += 'G'
            elif x == 'T':
                complement += 'A'
    return complement

thank you guys so much for your help!非常感谢你们的帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM