I am looking to identify duplicates in a raw textfile I have, once a duplicate has been identified, I want to ignore it when creating a new CSV file.
raw_file_reader = csv.DictReader(open(raw_file), delimiter='|')
Keep in mine raw file is a simple .txt file
with open('file') as f:
seen = set()
for line in f:
line_lower = line.lower()
if line_lower in seen:
print(line)
else:
seen.add(line_lower)
I can find the duplicates using sets
for row in raw_file_reader:
if ('Symbol' in row):
symbol = row['Symbol']
elif ('SYMBOL' in row):
symbol = row['SYMBOL']
else:
raise exception
if symbol not in symbol_lookup:
continue
I am just not sure how to actually ignore the duplicates before converting to the csv file.
I'd use the csv
library to do this. Additionally, there is a built in way to enumerate items. So, let's use that.
import csv
with open("in.txt","r") as fi, open("out.csv","w") as fo:
writer = csv.writer(fo, lineterminator='\n')
writer.writerows(enumerate(set(fi.read().split("|"))))
You could remove duplicates by storing all entries in a set as you go along as follows:
import csv
seen = set()
output = []
source_file = "file.csv"
with open(source_file, 'rb') as f_input:
csv_input = csv.reader(f_input, delimiter='|')
for row in csv_input:
if tuple(row) not in seen:
output.append(row)
seen.add(tuple(row))
with open(source_file, 'wb') as f_output:
csv_output = csv.writer(f_output)
csv_output.writerows(output)
Giving you an output file:
20100830,TECD,1500,4300,N
20100830,TECH,100,100,N
20100830,TECUA,100,391,N
20100830,TEF,1300,1300,N
20100830,TEG,900,1900,N
This works by converting each whole row into a tuple which can then be stored as a set. This makes testing for duplicate lines straight forward.
Tested on Python 2.7.12
You can simple create a custom iterator that will return the original file lines removing duplicates:
class Dedup:
def __init__(self, fd):
self.fd = fd # store the original file object
self.seen = set() # initialize an empty set for lines
def __next__(self): # the iterator method
while True:
line = next(self.fd)
if not line in self.seen:
self.seen.add(line)
return line
# print("DUP>", line.strip(), "<") # uncomment for tests
def __iter__(self): # make the iterator compatible Python 2 and 3
return self
def next(self):
return self.__next__()
def __enter__(self): # make it a context manager supporting with
return self
def __exit__(self, typ, value, traceback):
self.fd.close() # cleanup
You can then create your DictReader simply:
with Dedup(open(raw_file)) as fd:
reader = csv.DictReader(fd, delimiter='|')
for row in reader:
# process each now unique row...
But beware! This will require that all lines are stored in the set, meaning that the original file must fit in memory.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.