Removing duplicates in a text file before converting to a CSV

Question

I am looking to identify duplicates in a raw textfile I have, once a duplicate has been identified, I want to ignore it when creating a new CSV file.

raw_file_reader = csv.DictReader(open(raw_file), delimiter='|')

Keep in mine raw file is a simple .txt file

with open('file') as f:
  seen = set()
  for line in f:
      line_lower = line.lower()
      if line_lower in seen:
          print(line)
      else:
          seen.add(line_lower)

I can find the duplicates using sets

for row in raw_file_reader:
            if ('Symbol' in row):
                symbol = row['Symbol']
            elif ('SYMBOL' in row):
                symbol = row['SYMBOL']
            else:
                raise exception

            if symbol not in symbol_lookup:
                continue

I am just not sure how to actually ignore the duplicates before converting to the csv file.

Answer 1

I'd use the csv library to do this. Additionally, there is a built in way to enumerate items. So, let's use that.

import csv
with open("in.txt","r") as fi, open("out.csv","w") as fo:
    writer = csv.writer(fo, lineterminator='\n')
    writer.writerows(enumerate(set(fi.read().split("|"))))

Answer 2

You could remove duplicates by storing all entries in a set as you go along as follows:

import csv

seen = set()
output = []

source_file = "file.csv"

with open(source_file, 'rb') as f_input:
    csv_input = csv.reader(f_input, delimiter='|')

    for row in csv_input:
        if tuple(row) not in seen:
            output.append(row)
            seen.add(tuple(row))

with open(source_file, 'wb') as f_output:        
    csv_output = csv.writer(f_output)
    csv_output.writerows(output)

Giving you an output file:

20100830,TECD,1500,4300,N
20100830,TECH,100,100,N
20100830,TECUA,100,391,N
20100830,TEF,1300,1300,N
20100830,TEG,900,1900,N

This works by converting each whole row into a tuple which can then be stored as a set. This makes testing for duplicate lines straight forward.

Tested on Python 2.7.12

Answer 3

You can simple create a custom iterator that will return the original file lines removing duplicates:

class Dedup:
    def __init__(self, fd):
        self.fd = fd       # store the original file object
        self.seen = set()  # initialize an empty set for lines
    def __next__(self):        # the iterator method
        while True:
            line = next(self.fd)
            if not line in self.seen:
                self.seen.add(line)
                return line
            # print("DUP>", line.strip(), "<") # uncomment for tests
    def __iter__(self):        # make the iterator compatible Python 2 and 3
        return self
    def next(self):
        return self.__next__()
    def __enter__(self):       # make it a context manager supporting with
        return self
    def __exit__(self, typ, value, traceback):
        self.fd.close()        # cleanup

You can then create your DictReader simply:

with Dedup(open(raw_file)) as fd:
   reader = csv.DictReader(fd, delimiter='|')
   for row in reader:
        # process each now unique row...

But beware! This will require that all lines are stored in the set, meaning that the original file must fit in memory.

Removing duplicates in a text file before converting to a CSV

Question

3 answers

solution1
0 2017-06-01 15:26:32

solution2
0 2017-06-01 15:35:07

solution3
0 2017-06-01 16:11:32

Removing duplicates in a text file before converting to a CSV

Question

3 answers

solution1 0 2017-06-01 15:26:32

solution2 0 2017-06-01 15:35:07

solution3 0 2017-06-01 16:11:32

solution1
0 2017-06-01 15:26:32

solution2
0 2017-06-01 15:35:07

solution3
0 2017-06-01 16:11:32