[英]Python + CSV : Sum up Similar Values from a CSV columns
輸入文件:
$ cat dummy.csv
OS,A,B,C,D,E
Ubuntu,0,1,0,1,1
Windows,0,0,1,1,1
Mac,1,0,1,0,0
Ubuntu,1,1,1,1,0
Windows,0,0,1,1,0
Mac,1,0,1,1,1
Ubuntu,0,1,0,1,1
Ubuntu,0,0,1,1,1
Ubuntu,1,0,1,0,0
Ubuntu,1,1,1,1,0
Mac,0,0,1,1,0
Mac,1,0,1,1,1
Windows,1,1,1,1,0
Ubuntu,0,0,1,1,0
Windows,1,0,1,1,1
Mac,0,1,0,1,1
Windows,0,0,1,1,1
Mac,1,0,1,0,0
Windows,1,1,1,1,0
Mac,0,0,1,1,0
預期產量:
OS,A,B,C,D,E
Mac,4,1,6,5,3
Ubuntu,3,4,5,6,3
Windows,3,2,6,6,3
我使用Excel的數據透視表生成了上述輸出。
我的代碼:
import csv
import pprint
from collections import defaultdict
d = defaultdict(dict)
with open('dummy.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
d[row['OS']]['A'] += row['A']
d[row['OS']]['B'] += row['B']
d[row['OS']]['C'] += row['C']
d[row['OS']]['D'] += row['D']
d[row['OS']]['E'] += row['E']
pprint.pprint(d)
錯誤:
$ python3 dummy.py
Traceback (most recent call last):
File "dummy.py", line 10, in <module>
d[row['OS']]['A'] += row['A']
KeyError: 'A'
我的想法是將CSV值累加到字典中,然后打印出來。 但是,當我嘗試添加值時,出現上述錯誤。
這似乎可以通過內置的csv
模塊來實現。 我認為這是一個簡單的方法:(任何指針都會有很大幫助。
有兩個問題。 嵌套字典最初沒有設置任何鍵,因此d[row[OS]]['A']
結果為錯誤。 另一個問題是您需要在添加列值之前將其轉換為int
。
您可以使用Counter
作為defaultdict
值,因為缺少的鍵默認為0
:
import csv
from collections import Counter, defaultdict
d = defaultdict(Counter)
with open('dummy.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
nested = d[row.pop('OS')]
for k, v in row.items():
nested[k] += int(v)
print(*d.items(), sep='\n')
輸出:
('Ubuntu', Counter({'D': 6, 'C': 5, 'B': 4, 'E': 3, 'A': 3}))
('Windows', Counter({'C': 6, 'D': 6, 'E': 3, 'A': 3, 'B': 2}))
('Mac', Counter({'C': 6, 'D': 5, 'A': 4, 'E': 3, 'B': 1}))
這不能完全回答您的問題,因為使用csv
確實可以解決問題,但是值得一提的是, pandas
非常適合此類情況:
In [1]: import pandas as pd
In [2]: df = pd.read_csv('dummy.csv')
In [3]: df.groupby('OS').sum()
Out[3]:
A B C D E
OS
Mac 4 1 6 5 3
Ubuntu 3 4 5 6 3
Windows 3 2 6 6 3
像這樣嗎 您可以將數據幀寫入csv文件以獲得所需的格式。
import pandas as pd
# df0=pd.read_clipboard(sep=',')
# df0
df=df0.copy()
df=df.groupby(by='OS').sum()
print df
輸出:
A B C D E
OS
Mac 4 1 6 5 3
Ubuntu 3 4 5 6 3
Windows 3 2 6 6 3
df.to_csv('file01')
文件01
OS,A,B,C,D,E
Mac,4,1,6,5,3
Ubuntu,3,4,5,6,3
Windows,3,2,6,6,3
您之所以遇到該異常,是因為d
第一次不存在row['OS']
,因此d[row['OS']]
中不存在'A'
。 請嘗試以下解決此問題:
import csv
from collections import defaultdict
d = defaultdict(dict)
with open('dummy.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
d[row['OS']]['A'] = d[row['OS']]['A'] + int(row['A']) if (row['OS'] in d and 'A' in d[row['OS']]) else int(row['A'])
d[row['OS']]['B'] = d[row['OS']]['B'] + int(row['B']) if (row['OS'] in d and 'B' in d[row['OS']]) else int(row['B'])
d[row['OS']]['C'] = d[row['OS']]['C'] + int(row['C']) if (row['OS'] in d and 'C' in d[row['OS']]) else int(row['C'])
d[row['OS']]['D'] = d[row['OS']]['D'] + int(row['D']) if (row['OS'] in d and 'D' in d[row['OS']]) else int(row['D'])
d[row['OS']]['E'] = d[row['OS']]['E'] + int(row['E']) if (row['OS'] in d and 'E' in d[row['OS']]) else int(row['E'])
輸出:
>>> import pprint
>>>
>>> pprint.pprint(dict(d))
{'Mac': {'A': 4, 'B': 1, 'C': 6, 'D': 5, 'E': 3},
'Ubuntu': {'A': 3, 'B': 4, 'C': 5, 'D': 6, 'E': 3},
'Windows': {'A': 3, 'B': 2, 'C': 6, 'D': 6, 'E': 3}}
d
是字典,因此d[row['OS']]
是有效的表達式,但是d[row['OS']]['A']
期望該字典項是某種集合。 由於您未提供默認值,因此它將是None
,而不是。
這擴展了niemmi的 解決方案,以將輸出格式化為與OP的 示例相同:
import csv
from collections import Counter, defaultdict
d = defaultdict(Counter)
with open('dummy.csv') as csv_file:
reader = csv.DictReader(csv_file)
field_names = reader.fieldnames
for row in reader:
counter = d[row.pop('OS')]
for key, value in row.iteritems():
counter[key] += int(value)
print ','.join(field_names)
for os, counter in sorted(d.iteritems()):
print "%s,%s" % (os, ','.join([str(v) for k, v in sorted(counter.iteritems())]))
輸出量
OS,A,B,C,D,E
Mac,4,1,6,5,3
Ubuntu,3,4,5,6,3
Windows,3,2,6,6,3
更新:修復了輸出。
我假設您的輸入文件名為input_file.csv
。
您還可以使用itertools
模塊中的groupby
和two dicts
來處理數據並獲得所需的輸出,如下例所示:
from itertools import groupby
data = list(k.strip("\n").split(",") for k in open("input_file.csv", 'r'))
a, b = {}, {}
for k, v in groupby(data[1:], lambda x : x[0]):
try:
a[k] += [i[1:] for i in list(v)]
except KeyError:
a[k] = [i[1:] for i in list(v)]
for key in a.keys():
for j in range(5):
c = 0
for i in a[key]:
c += int(i[j])
try:
b[key] += ',' + str(c)
except KeyError:
b[key] = str(c)
輸出:
print(','.join(data[0]))
for k in b.keys():
print("{0},{1}".format(k, b[k]))
>>> OS,A,B,C,D,E
>>> Ubuntu,3,4,5,6,3
>>> Windows,3,2,6,6,3
>>> Mac,4,1,6,5,3
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.