简体   繁体   中英

How to join two consecutive lines of a file if they meet a certain condition?

** New to Python, sorry **

I'm trying to take a given example file and add only the lines containing "A" or "T" or "G" or "C" (DNA strands) to a list, using a function.

Example file:

gene1
ATGATGATGGCG
gene2
GGCATATC
CGGATACC
gene3
TAGCTAGCCCGC

Under gene2 there are two separate lines I need to concatenate using my function.

Here's what I have completed for my function:

def create(filename):
    """
    Purpose: Creates and returns a data structure (list) to store data.
    :param filename: The given file
    Post-conditions: (none)
    :return: List of data.
    """
    new_list = []
    f = open(filename, 'r')
    for i in f:
        if not('A' or 'T' or 'G' or 'C') in i:
            new_list = new_list  #Added this so nothing happens but loop cont.
        else:
            new_list.append(i.strip())
    f.close()
    return new_list

I need to somehow find parts of the file where there are two consecutive lines of DNA ("GTCA") and join them before adding them to my list.

If done correctly the output when printed should read:

['ATGATGATGGCG', 'GGCATATCCGGATACC', 'TAGCTAGCCCGC']

Thanks in advance!

You can use set s to check if a line is a DNA line, ie consists of the letters ACGT only:

with open(filename) as f:
    new_list = []
    concat = False
    for line in f:
        if set(line.strip()) == {'A', 'C', 'G', 'T'}:
            if concat:
                new_list[-1] += line.strip()
            else:
                new_list.append(line.strip())
            concat = True
        else:
            concat = False

# ['ATGATGATGGCG', 'GGCATATCCGGATACC', 'TAGCTAGCCCGC']

Regexes to the rescue!

import re

def create(filename):
    dna_regex = re.compile(r'[ATGC]+')
    with open(filename, 'r') as f:
        return dna_regex.findall(f.read().replace('\n', '')))

new_list = []
new_list += create("gene_file.txt")

It's important to note that this implementation in particular might get a false positive if the gene lines contains an A, T, G, or C.

What this does is it takes in the whole file, removes the newlines, and then finds all of the sequences containing only A, T, G, or C and returns them.

If we can assume that each DNA section is prefixed by one line, we can take advantage of the takewhile function that'll group the DNA lines:

from itertools import takewhile

DNA_CHARS = ('A', 'T', 'G', 'C')
lines = ['gene1', 'ATGATGATGGCG', 'gene2', 'GGCATATC', 'CGGATACC', 'gene3', 'TAGCTAGCCCGC']

input_lines = iter(lines[1:])
dna_lines = []

while True:
    dna_line = ''.join(takewhile(lambda l: any(dna_char in l for dna_char in DNA_CHARS),
                                  input_lines))
    if not dna_line:
        break
    dna_lines.append(dna_line)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM