简体   繁体   中英

How to prevent de-duplication of a particular combination of entries in an array in python?

I am writing a piece of code in python where i am working with arrays. I am loading data row-wise from a csv into my array. The data looks somewhat like this:

aaa,bbb,ccc,ddd,eee,fff,ggg,hhh
111,222,333,444,555,666,777,888
abb,acc,add,ddd,vvv,bxc,nyc,hhh

now in the first and third rows even though the rows do not match exactly, my columns of interest are column 4 and column 8 ie if two rows have same data in these columns as shown in the example, these should be treated as duplicate entries and my array should have only the first and second rows and should not have the third row.

result=[]
for file in input_file:
    f=open(file,'r')
    reader = csv.reader(f, quotechar='"')#read csv 
    for row in reader:
        if row:
            #do some operations on the elements of row
                if(row[3] and row[7] not in result):#
                    result.append(row)#load result in array
                else:
                    continue

I expect the result array to be like this

aaa,bbb,ccc,ddd,eee,fff,ggg,hhh
111,222,333,444,555,666,777,888

whereas the output is

aaa,bbb,ccc,ddd,eee,fff,ggg,hhh
111,222,333,444,555,666,777,888
abb,acc,add,ddd,vvv,bxc,nyc,hhh

1: Load your csv using pands 2: take the data only for interested column 3: user pd.drop_duplicates()

refer link [ https://www.geeksforgeeks.org/python-pandas-dataframe-drop_duplicates/][1]

import pandas as pd
df = pd.read_csv("YOUR_FILE_NAME")
df.drop_duplicates(subset['first_intrested_column','second_intrested_column'],keep
=False, inplace=True)

The data you want to examine for dups is a pair of two values (columns 3 and 7 using zero based numbering). A set named seen is often used for that purpose. The basic idea is:

seen = set()
for row in reader:
    data = (row[3], row[7])
    if data in set:
        continue
    set.add(data)
    # process row

The problem with your code is the test for duplicates is incorrect.
Here's a version I think does it correctly:

import csv
from io import StringIO
from pprint import pprint, pformat

input_file = ['''
aaa,bbb,ccc,ddd,eee,fff,ggg,hhh
111,222,333,444,555,666,777,888
abb,acc,add,ddd,vvv,bxc,nyc,hhh
''',]

result=[]
for file in input_file:
#    f=open(file,'r')
    f = StringIO(file)
    reader = csv.reader(f, quotechar='"')  # read csv
    for row in reader:
        if row and not any((row[3] == r[3] and row[7] == r[7]) for r in result):
            result.append(row)  # load result in array

pprint(result)

Output:

[['aaa', 'bbb', 'ccc', 'ddd', 'eee', 'fff', 'ggg', 'hhh'],
 ['111', '222', '333', '444', '555', '666', '777', '888']]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM