简体   繁体   中英

Unable to remove line breaks in a text file in python

At the risk of losing reputation I did not know what else to do. My file is not showing any hidden characters and I have tried every .replace and .strip I can think of. My file is UTF-8 encoded and I am using python/3.6.1 I have a file with the format:

 >header1
 AAAAAAAA
 TTTTTTTT
 CCCCCCCC
 GGGGGGGG

 >header2
 CCCCCC
 TTTTTT
 GGGGGG
 AAAAAA

I am trying to remove line breaks from the end of the file to make each line a continuous string. (This file is actually thousands of lines long). My code is redundant in the sense that I typed in everything I could think of to remove line breaks:

 fref = open(ref)
 for line in fref:
     sequence = 0
     header = 0
     if line.startswith('>'):
          header = ''.join(line.splitlines())
          print(header)
     else:
          sequence = line.strip("\n").strip("\r")
          sequence = line.replace('\n', ' ').replace('\r', '').replace(' ', '').replace('\t', '')
          print(len(sequence))

output is:

 >header1
 8
 8
 8
 8
 >header2
 6
 6
 6
 6

But if I manually go in and delete the end of line to make it a continuous string it shows it as a congruent string.

Expected output:

 >header1
 32
 >header2
 24     

Thanks in advance for any help, Dennis

There are several approaches to parsing this kind of input. In all cases, I would recommend isolating the open and print side-effects outside of a function that you can unit test to convince yourself of the proper behavior.

You could iterate over each line and handle the case of empty lines and end-of-file separately. Here, I use yield statements to return the values:

def parse(infile):
    for line in infile:
        if line.startswith(">"):
            total = 0
            yield line.strip()
        elif not line.strip():
            yield total
        else:
            total += len(line.strip())
    if line.strip():
        yield total

def test_parse(func):
    with open("input.txt") as infile:
        assert list(parse(infile)) == [
            ">header1",
            32,
            ">header2",
            24,
        ]

Or, you could handle both empty lines and end-of-file at the same time. Here, I use an output array to which I append headers and totals:

def parse(infile):
    output = []
    while True:
        line = infile.readline()
        if line.startswith(">"):
            total = 0
            header = line.strip()
        elif line and line.strip():
            total += len(line.strip())
        else:
            output.append(header)
            output.append(total)
            if not line:
                break

    return output

def test_parse(func):
    with open("input.txt") as infile:
        assert parse(infile) == [
            ">header1",
            32,
            ">header2",
            24,
        ]

Or, you could also split the whole input file into empty-line-separated blocks and parse them independently. Here, I use an output stream to which I write the output; in production, you could pass the sys.stdout stream for example:

import re
def parse(infile, outfile):
    content = infile.read()
    for block in re.split(r"\r?\n\r?\n", content):
        header, *lines = re.split(r"\s+", block)
        total = sum(len(line) for line in lines)
        outfile.write("{header}\n{total}\n".format(
            header=header,
            total=total,
        ))

from io import StringIO
def test_parse(func): 
    with open("/tmp/a.txt") as infile: 
        outfile = StringIO() 
        parse(infile, outfile) 
        outfile.seek(0) 
        assert outfile.readlines() == [ 
            ">header1\n", 
            "32\n", 
            ">header2\n", 
            "24\n", 
        ]

Note that my tests use open("input.txt") for brevity but I would actually recommend passing a StringIO(...) instance instead to see the input being tested more easily, to avoid hitting the filesystem and to make the tests faster.

From my understanding of your question you would like something like this: Note how the sequence is build over multiple iteration steps of the loop, as you wish to combine multiple lines.

with open(ref) as f:
    sequence = "" # reset sequence
    header = None
    for line in f:
        if line.startswith('>'):
            if header:
                print(header)        # print last header
                print(len(sequence)) # print last sequence
            sequence = ""      # reset sequence
            header = line[1:]  # store header
        else:
            sequence += line.rstrip()   # append line to sequence

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM