[英]How to manipulate a CSV in python with same values on a column and create a new one with unique values
The issue that I have to face is that I have a csv file with same data on more than one column(here the unique_code), and I want to create a new csv having only one time the data on this column and the data from the other columns to be seperated by space if they are different(here the alternative_code). 我必须面对的问题是,我有一个csv文件,其中一个数据包含多个列(此处为unique_code),并且我想创建一个新的csv,该列仅包含该列数据和其他列,如果它们不同则用空格分隔(此处为alternate_code)。
Here is my csv. 这是我的csv。
Unique_code description alternative_code 唯一代码说明Alternative_code
33;product1;58
43;product2;95
33;product1;62
68;product3;11
43;product2;99
My desired csv result: 我想要的csv结果:
33;product1;58 62
43;product2;95 99
68;product3;11
Any ideas on how can I implement my new csv? 关于如何实现新的csv的任何想法?
You can try something like: 您可以尝试如下操作:
vals = {}
names = {}
with open(input_filename,'r') as file:
for line in file:
l = line.replace("\n","")
l = l.split(";")
if(vals.has_key(l[0])):
vals[l[0]].append(l[2])
else:
vals[l[0]] = [l[2]]
names[l[0]] = l[1]
with open(output_filename,'w') as file:
for key in vals.keys():
res = str(key)+";"+str(names[key])+";"+str(vals[key][0])
for i in range(0,len(vals[key])-1):
res += " "+vals[key][i+1]
res += '\n'
file.write(res)
import csv
with open("my_file.csv", 'r') as fd:
#import csv as list of list and remove blank line
data = [i for i in csv.reader(fd, delimiter=';') if i]
result = []
for value in data:
#check if product not in result
if value[1] not in [r[1] for r in result if r]:
#add the new product to result with all values for the same product
result.append([value[0],
value[1],
' '.join([line[2] for line in data if line[1] == value[1]])
])
print(result)
Finally I end up to this solution: 最后,我得出了这个解决方案:
# -*- coding: utf-8 -*-
import csv
input_file_1 = "eidi.csv"
output_file = "output.csv"
parsed_dictionary={}
def concatenate_alter_codes(alter_code_list):
result = ""
for alter_code in alter_code_list:
result = result + (alter_code + " ")
print result
return result[:-1]
#Read input csv file and create a dictionary with a list of all alter codes
with open(input_file_1,'r') as f:
# put ; symbol as delimeter
input_csv=csv.reader(f,delimiter=';')
for row in input_csv:
# if the key exists in the dictionary
if row[0] in parsed_dictionary:
parsed_dictionary[row[0]][0].append(row[2])
else:
parsed_dictionary[row[0]] = [[row[2]], row[1], row[3], row[4], row[5], row[6]]
#create new csv file with concatenated alter codes
with open(output_file,'w') as f:
for key in parsed_dictionary:
f.write(key + ";" + concatenate_alter_codes(parsed_dictionary[key][0]) + ";" + parsed_dictionary[key][1] + ";" + parsed_dictionary[key][2] + ";" + parsed_dictionary[key][3] + ";" + parsed_dictionary[key][4] + ";" + parsed_dictionary[key][5] + "\n")
littletable is a thin CSV-wrapper I wrote a number of years ago. littletable是我几年前写的一个瘦CSV包装器。 Tables in littletable are lists of objects, with some helper methods for filtering, joining, pivoting, plus easy import/export of CSV, JSON, and fixed format data.
littletable中的表是对象列表,带有一些用于过滤,联接,数据透视的辅助方法,以及轻松导入/导出CSV,JSON和固定格式数据的方法。 Like pandas, it helps with the data import/export, but doesn't have all the other numeric analytical features that pandas has.
像熊猫一样,它有助于数据的导入/导出,但不具有熊猫具有的所有其他数字分析功能。 It also keeps all the data in memory as a list of Python objects, so it wouldn't handle millions of rows as well as pandas would.
它还将所有数据作为Python对象列表保存在内存中,因此它不会像熊猫那样处理数百万行。 But if your needs are modest, then it might be a shorter learning curve to work with littletable.
但是,如果您的需求适中,那么使用littletable可能会缩短学习时间。
To load your initial raw data into a littletable Table starts with: 要将初始原始数据加载到littletable表中,首先需要:
import littletable as lt
data = open('raw_data.csv')
tt = lt.Table().csv_import(data, fieldnames="id name altid".split(), delimiter=';')
(If there were a header row in your input file, csv_import
would use that and would not require that you specify fieldnames
.) (如果输入文件中包含标题行,则
csv_import
将使用该标题行,并且不需要您指定fieldnames
。)
Printing out the rows looks just like iterating over a list: 打印出行看起来就像遍历列表:
for row in tt:
print(row)
prints: 打印:
{'name': 'product1', 'altid': '58', 'id': '33'}
{'name': 'product2', 'altid': '95', 'id': '43'}
{'name': 'product1', 'altid': '62', 'id': '33'}
{'name': 'product3', 'altid': '11', 'id': '68'}
{'name': 'product2', 'altid': '99', 'id': '43'}
Because we'll be grouping and joining on the id
attributes, we add an index: 因为我们将对
id
属性进行分组和联接,所以我们添加了一个索引:
tt.create_index("id")
(Unique indexes can be created also, but in this case, there are duplicate values in your raw input with the same id.) (也可以创建唯一索引,但是在这种情况下,原始输入中的重复值具有相同的ID。)
Tables can be grouped by one or more attributes, and then each group of records can be passed to a function to give an aggregate value for that group. 可以按一个或多个属性对表进行分组,然后可以将每组记录传递给一个函数以提供该组的汇总值。 In your case, you want all the collected
altids
for each product id
. 对于您的情况,您需要每个产品
id
所有收集的altids
。
def aggregate_altids(rows):
return ' '.join(set(row.altid for row in rows if row.altid != row.id))
grouped_altids = tt.groupby("id", altids=aggregate_altids)
for row in grouped_altids:
print(row)
Gives: 得到:
{'altids': '62 58', 'id': '33'}
{'altids': '99 95', 'id': '43'}
{'altids': '11', 'id': '68'}
Now we'll join this table with the original tt
table on id
, and collapse out duplicates: 现在,我们将此表与
id
上的原始tt
表连接起来,并折叠出重复项:
tt2 = (grouped_altids.join_on('id') + tt)().unique("id")
And print out the results: 并打印出结果:
for row in tt2:
print("{id};{name};{alt_ids}".format_map(vars(row)))
Giving: 赠送:
33;product1;58 62
43;product2;95 99
68;product3;11
The total code without the debugging looks like: 没有调试的总代码如下:
# import
import littletable as lt
with open('raw_data.csv') as data:
tt = lt.Table().csv_import(data, fieldnames="id name altid".split(), delimiter=';')
tt.create_index("id")
# group
def aggregate_altids(rows):
return ' '.join(set(row.altid for row in rows if row.altid != row.id))
grouped_altids = tt.groupby("id", alt_ids=aggregate_altids)
# join, dedupe, and sort
tt2 = (grouped_altids.join_on('id') + tt)().unique("id").sort("id")
# output
for row in tt2:
print("{id};{name};{alt_ids}".format_map(vars(row)))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.