简体   繁体   中英

How to delete repeating lines starting with specific word in python

I have an inputfile of the form

All tests start with the word "Test" and all errors start with the word "error"

Test1
Error1
Error1 
Error2
Test1
Error3

Test2
Error1
Error4 

Test2
Error5
Error1

Test3
Error1

I want it in the format:
Test1
Error1
Error1
Error2
Error3 // Removed test1 

Test2
Error1
Error4
Error5
Error1

Test3
Error1 

Basically while going through the file, it should delete repeated testnames and write it in the same order to an output file. Following is my code

import os
import sys
import optparse

def delete_duplicate(inputfile,outputfile): 
    output = open(outputfile, "w")
    from collections import OrderedDict
    input = open(inputfile, "r")
    lines = (line.strip() for line in input)
    unique_lines = OrderedDict.fromkeys((line for line in lines if line))
    for unique_line in unique_lines:
        output.write(unique_line)
        output.write("\n") 

My code removes duplicate lines and gives result as below: 
Test1
Error1
Error2
Error3 

Test2
Error4
Error5

Test3 

It is working fine with testnames but not with errors. Can anybody help?

All you need is to preserve the lines that starts with Test in a set and check if you have it already just don't write it in output file :

def delete_duplicate(inputfile,outputfile,seen={}):
    with open(outputfile, "w") as output,open(inputfile, "r") as input: 
      for line in input:
        if line not in seen:
             output.write(line+'\n')
        if line.startswith('Test'):
            seen.add(line)

The advantage of set is that its order is O(1) for check the membership and adding items.

At the moment it looks like your code is simply inserting each line into the dictionary if it hasn't come across it before. It also seems like you want to track the errors independently for each test. You could do this with an OrderedDict that would look a bit like this:

output_dict = {
    'test1' : ['Error1','Error1','Error2','Error3'],
    'test2' : ['Error1','Error4','Error5','Error1']
}

The code to handle this would look like the following.

import os
import sys
import optparse
from collections import OrderedDict


def delete_duplicate(inputfile,outputfile): 
    # Declare the files and get the lines
    outfile = open(outputfile, "w")
    infile = open(inputfile, "r")
    lines = (line.strip() for line in infile)

    output_dict = OrderedDict()
    currentTest = '' # Used to keep track of which test we are working with

    for line in lines:
        if line.startswith('Test'): # A new test is starting
            currentTest = line
            if currentTest not in output_dict:
                output_dict[currentTest] = []
        elif line.startswith('Error'): # Add the error to the current test
            output_dict[currentTest].append(line)

    for test in output_dict.keys():
        outfile.write(test + '\n') # Write the test number
        for error in output_dict[test]:
            outfile.write(error + '\n') # Write the errors for that test
        outfile.write('\n')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM