[英]Python unique values per column in csv file row
Crunching on this for a long time. 在很长一段时间内都在处理。 Is there an easy way using Numpy or Pandas or fixing my code to get the unique values for the column in a row separated by "|"
有没有一种简单的方法可以使用Numpy或Pandas或修复我的代码以获取由“ |”分隔的行中的唯一值
Ie the data: 即数据:
"id","fname","lname","education","gradyear","attributes"
"1","john","smith","mit|harvard|ft|ft|ft","2003|207|212|212|212","qa|admin,co|master|NULL|NULL"
"2","john","doe","htw","2000","dev"
Output should be: 输出应为:
"id","fname","lname","education","gradyear","attributes"
"1","john","smith","mit|harvard|ft","2003|207|212","qa|admin,co|master|NULL"
"2","john","doe","htw","2000","dev"
My broken code: 我的断码:
import csv
import pprint
your_list = csv.reader(open('out.csv'))
your_list = list(your_list)
#pprint.pprint(your_list)
string = "|"
cols_no=6
for line in your_list:
i=0
for col in line:
if i==cols_no:
print "\n"
i=0
if string in col:
values = col.split("|")
myset = set(values)
items = list()
for item in myset:
items.append(item)
print items
else:
print col+",",
i=i+1
It outputs: 它输出:
id, fname, lname, education, gradyear, attributes, 1, john, smith, ['harvard', 'ft', 'mit']
['2003', '212', '207']
['qa', 'admin,co', 'NULL', 'master']
2, john, doe, htw, 2000, dev,
Thanks in advance! 提前致谢!
numpy
/ pandas
is a bit overkill for what you can achieve with csv.DictReader
and csv.DictWriter
with a collections.OrderedDict
, eg: numpy
/ pandas
对于使用具有collections.OrderedDict
csv.DictReader
和csv.DictWriter
可以实现的功能来说有点过头了,例如:
import csv
from collections import OrderedDict
# If using Python 2.x - use `open('output.csv', 'wb') instead
with open('input.csv') as fin, open('output.csv', 'w') as fout:
csvin = csv.DictReader(fin)
csvout = csv.DictWriter(fout, fieldnames=csvin.fieldnames, quoting=csv.QUOTE_ALL)
csvout.writeheader()
for row in csvin:
for k, v in row.items():
row[k] = '|'.join(OrderedDict.fromkeys(v.split('|')))
csvout.writerow(row)
Gives you: 给你:
"id","fname","lname","education","gradyear","attributes"
"1","john","smith","mit|harvard|ft","2003|207|212","qa|admin,co|master|NULL"
"2","john","doe","htw","2000","dev"
If you don't care about the order when you have many items separated with |
如果您不关心订单,当有许多用
|
分隔的项目时 , this will work: ,这将起作用:
lst = ["id","fname","lname","education","gradyear","attributes",
"1","john","smith","mit|harvard|ft|ft|ft","2003|207|212|212|212","qa|admin,co|master|NULL|NULL",
"2","john","doe","htw","2000","dev"]
def no_duplicate(string):
return "|".join(set(string.split("|")))
result = map(no_duplicate, lst)
print result
result: 结果:
['id', 'fname', 'lname', 'education', 'gradyear', 'attributes', '1', 'john', 'smith', 'ft|harvard|mit', '2003|207|212', 'NULL|admin,co|master|qa', '2', 'john', 'doe', 'htw', '2000', 'dev']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.