Two columns ("Name" & "Value") in excel.
There are duplicates (eg. "xxa","xxf") in the Value column and the python script needs to find what are the duplicates cell values and put them into an array
The output should be "xxa": ["aaa","bbb","ccc","hhh"]
"xxf": ["fff","jjj"]
How to improve the current script?
file = open('columnData.csv')
csvreader = csv.reader(file)
next(csvreader)
for row in csvreader:
name = row[0]
value = row[1]
value_col.append(value)
name_value_col.append(name+","+value)
file.close()
count={}
names=[]
for item in value_col:
if value_col.count(item)>1:
count[item]=value_col.count(item)
for name,value in count.items():
names.append(name)
total=[]
for item in name_value_col:
item_name=item.split(",")
if item_name[1] in names:
total.append(item_name[0])
print(total)
I'd recommend using defaultdict
, and while you're at it using csv.DictReader
makes for more legible code:
import csv
from collections import defaultdict
data = defaultdict(list)
with open('columnData.csv') as f:
reader = csv.DictReader(f)
for row in reader:
data[row['Value']].append(row['Name'])
and then regarding duplicate finding you can EITHER take the destructive approach (pruning non-duplicates)
# Remove non-duplicates here
for key in list(data.keys()): # note need to take a copy of the keys
if len(data[key]) == 1: # only one value in the list
del data[key]
print(dict(data))
>>> {"xxa": ["aaa","bbb","ccc","hhh"], "xxf": ["fff","jjj"]}
or if you prefer a non-destructive approach to finding duplicates:
def _filter_duplicates(data):
for key, value in data.items():
if len(value) > 1:
yield key, value
def find_duplicates(data):
return dict(_filter_duplicates(data))
print(find_duplicates(data))
>>> {"xxa": ["aaa","bbb","ccc","hhh"], "xxf": ["fff","jjj"]}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.