简体   繁体   中英

Removing duplicates in a text file before converting to a CSV

I am looking to identify duplicates in a raw textfile I have, once a duplicate has been identified, I want to ignore it when creating a new CSV file.

raw_file_reader = csv.DictReader(open(raw_file), delimiter='|')

Keep in mine raw file is a simple .txt file

with open('file') as f:
  seen = set()
  for line in f:
      line_lower = line.lower()
      if line_lower in seen:
          print(line)
      else:
          seen.add(line_lower)

I can find the duplicates using sets

for row in raw_file_reader:
            if ('Symbol' in row):
                symbol = row['Symbol']
            elif ('SYMBOL' in row):
                symbol = row['SYMBOL']
            else:
                raise exception

            if symbol not in symbol_lookup:
                continue

I am just not sure how to actually ignore the duplicates before converting to the csv file.

I'd use the csv library to do this. Additionally, there is a built in way to enumerate items. So, let's use that.

import csv
with open("in.txt","r") as fi, open("out.csv","w") as fo:
    writer = csv.writer(fo, lineterminator='\n')
    writer.writerows(enumerate(set(fi.read().split("|"))))

You could remove duplicates by storing all entries in a set as you go along as follows:

import csv

seen = set()
output = []

source_file = "file.csv"

with open(source_file, 'rb') as f_input:
    csv_input = csv.reader(f_input, delimiter='|')

    for row in csv_input:
        if tuple(row) not in seen:
            output.append(row)
            seen.add(tuple(row))

with open(source_file, 'wb') as f_output:        
    csv_output = csv.writer(f_output)
    csv_output.writerows(output)

Giving you an output file:

20100830,TECD,1500,4300,N
20100830,TECH,100,100,N
20100830,TECUA,100,391,N
20100830,TEF,1300,1300,N
20100830,TEG,900,1900,N

This works by converting each whole row into a tuple which can then be stored as a set. This makes testing for duplicate lines straight forward.

Tested on Python 2.7.12

You can simple create a custom iterator that will return the original file lines removing duplicates:

class Dedup:
    def __init__(self, fd):
        self.fd = fd       # store the original file object
        self.seen = set()  # initialize an empty set for lines
    def __next__(self):        # the iterator method
        while True:
            line = next(self.fd)
            if not line in self.seen:
                self.seen.add(line)
                return line
            # print("DUP>", line.strip(), "<") # uncomment for tests
    def __iter__(self):        # make the iterator compatible Python 2 and 3
        return self
    def next(self):
        return self.__next__()
    def __enter__(self):       # make it a context manager supporting with
        return self
    def __exit__(self, typ, value, traceback):
        self.fd.close()        # cleanup

You can then create your DictReader simply:

with Dedup(open(raw_file)) as fd:
   reader = csv.DictReader(fd, delimiter='|')
   for row in reader:
        # process each now unique row...

But beware! This will require that all lines are stored in the set, meaning that the original file must fit in memory.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM