简体   繁体   中英

Python: Removing dupes from large text file

I need my code to remove duplicate lines from a file, at the moment it is just reproducing the same file as output. Can anyone see how to fix this? The for loop is not running as I would have liked.

#!usr/bin/python
import os
import sys

#Reading Input file
f = open(sys.argv[1]).readlines()

#printing no of lines in the input file
print "Total lines in the input file",len(f)

#temporary dictionary to store the unique records/rows
temp = {}

#counter to count unique items
count = 0

for i in range(0,9057,1):
    if i not in temp: #if row is not there in dictionary i.e it is unique so store it into a dictionary
        temp[f[i]] = 1;
        count += 1
    else:   #if exact row is there then print duplicate record and dont store that
        print "Duplicate Records",f[i]
        continue;

#once all the records are read print how many unique records are there
#u can print all unique records by printing temp
print "Unique records",count,len(temp)

#f = open("C://Python27//Vendor Heat Map Test 31072015.csv", 'w')
#print f
#f.close()
nf = open("C://Python34//Unique_Data.csv", "w")
for data in temp.keys():
        nf.write(data)
nf.close()


# Written by Gary O'Neill
# Date 03-08-15

This is a much better way to do what you want:

infile_path = 'infile.csv'
outfile_path = 'outfile.csv'

written_lines = set()

with open(infile_path, 'r') as infile, open(outfile_path, 'w') as outfile:
    for line in infile:
        if line not in written_lines:
            outfile.write(line)
            written_lines.add(line)
        else:
            print "Duplicate record: {}".format(line)

print "{} unique records".format(len(written_lines))

This will read one line at a time, so it works even on large files that don't fit into memory. While it's true that if they're mostly unique lines, written_lines will end up being large anyway, it's better than having two copies of almost every line in memory.

You should test the existence of f[i] in temp not i . Change the line:

 if i not in temp:

with

 if f[i] not in temp:

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM