I am writing a piece of code in python where i am working with arrays. I am loading data row-wise from a csv into my array. The data looks somewhat like this:
aaa,bbb,ccc,ddd,eee,fff,ggg,hhh
111,222,333,444,555,666,777,888
abb,acc,add,ddd,vvv,bxc,nyc,hhh
now in the first and third rows even though the rows do not match exactly, my columns of interest are column 4 and column 8 ie if two rows have same data in these columns as shown in the example, these should be treated as duplicate entries and my array should have only the first and second rows and should not have the third row.
result=[]
for file in input_file:
f=open(file,'r')
reader = csv.reader(f, quotechar='"')#read csv
for row in reader:
if row:
#do some operations on the elements of row
if(row[3] and row[7] not in result):#
result.append(row)#load result in array
else:
continue
I expect the result array to be like this
aaa,bbb,ccc,ddd,eee,fff,ggg,hhh
111,222,333,444,555,666,777,888
whereas the output is
aaa,bbb,ccc,ddd,eee,fff,ggg,hhh
111,222,333,444,555,666,777,888
abb,acc,add,ddd,vvv,bxc,nyc,hhh
1: Load your csv using pands 2: take the data only for interested column 3: user pd.drop_duplicates()
refer link [ https://www.geeksforgeeks.org/python-pandas-dataframe-drop_duplicates/][1]
import pandas as pd
df = pd.read_csv("YOUR_FILE_NAME")
df.drop_duplicates(subset['first_intrested_column','second_intrested_column'],keep
=False, inplace=True)
The data you want to examine for dups is a pair of two values (columns 3 and 7 using zero based numbering). A set named seen
is often used for that purpose. The basic idea is:
seen = set()
for row in reader:
data = (row[3], row[7])
if data in set:
continue
set.add(data)
# process row
The problem with your code is the test for duplicates is incorrect.
Here's a version I think does it correctly:
import csv
from io import StringIO
from pprint import pprint, pformat
input_file = ['''
aaa,bbb,ccc,ddd,eee,fff,ggg,hhh
111,222,333,444,555,666,777,888
abb,acc,add,ddd,vvv,bxc,nyc,hhh
''',]
result=[]
for file in input_file:
# f=open(file,'r')
f = StringIO(file)
reader = csv.reader(f, quotechar='"') # read csv
for row in reader:
if row and not any((row[3] == r[3] and row[7] == r[7]) for r in result):
result.append(row) # load result in array
pprint(result)
Output:
[['aaa', 'bbb', 'ccc', 'ddd', 'eee', 'fff', 'ggg', 'hhh'],
['111', '222', '333', '444', '555', '666', '777', '888']]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.