Python + CSV：從CSV列中匯總相似的值

Question

輸入文件：

$ cat dummy.csv 
OS,A,B,C,D,E
Ubuntu,0,1,0,1,1
Windows,0,0,1,1,1
Mac,1,0,1,0,0
Ubuntu,1,1,1,1,0
Windows,0,0,1,1,0
Mac,1,0,1,1,1
Ubuntu,0,1,0,1,1
Ubuntu,0,0,1,1,1
Ubuntu,1,0,1,0,0
Ubuntu,1,1,1,1,0
Mac,0,0,1,1,0
Mac,1,0,1,1,1
Windows,1,1,1,1,0
Ubuntu,0,0,1,1,0
Windows,1,0,1,1,1
Mac,0,1,0,1,1
Windows,0,0,1,1,1
Mac,1,0,1,0,0
Windows,1,1,1,1,0
Mac,0,0,1,1,0

預期產量：

OS,A,B,C,D,E
Mac,4,1,6,5,3
Ubuntu,3,4,5,6,3
Windows,3,2,6,6,3

我使用Excel的數據透視表生成了上述輸出。

我的代碼：

import csv
import pprint
from collections import defaultdict

d = defaultdict(dict)

with open('dummy.csv') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        d[row['OS']]['A'] += row['A']
        d[row['OS']]['B'] += row['B']
        d[row['OS']]['C'] += row['C']
        d[row['OS']]['D'] += row['D']
        d[row['OS']]['E'] += row['E']

pprint.pprint(d)

錯誤：

$ python3 dummy.py
Traceback (most recent call last):
  File "dummy.py", line 10, in <module>
    d[row['OS']]['A'] += row['A']
KeyError: 'A'

我的想法是將CSV值累加到字典中，然后打印出來。 但是，當我嘗試添加值時，出現上述錯誤。

這似乎可以通過內置的csv模塊來實現。 我認為這是一個簡單的方法:(任何指針都會有很大幫助。

Answer 1

有兩個問題。 嵌套字典最初沒有設置任何鍵，因此d[row[OS]]['A']結果為錯誤。 另一個問題是您需要在添加列值之前將其轉換為int 。

您可以使用Counter作為defaultdict值，因為缺少的鍵默認為0 ：

import csv
from collections import Counter, defaultdict

d = defaultdict(Counter)

with open('dummy.csv') as csvfile:
    reader = csv.DictReader(csvfile)

    for row in reader:
        nested = d[row.pop('OS')]
        for k, v in row.items():
            nested[k] += int(v)

print(*d.items(), sep='\n')

輸出：

('Ubuntu', Counter({'D': 6, 'C': 5, 'B': 4, 'E': 3, 'A': 3}))
('Windows', Counter({'C': 6, 'D': 6, 'E': 3, 'A': 3, 'B': 2}))
('Mac', Counter({'C': 6, 'D': 5, 'A': 4, 'E': 3, 'B': 1}))

Answer 2

這不能完全回答您的問題，因為使用csv確實可以解決問題，但是值得一提的是， pandas非常適合此類情況：

In [1]: import pandas as pd

In [2]: df = pd.read_csv('dummy.csv')

In [3]: df.groupby('OS').sum()
Out[3]:
         A  B  C  D  E
OS
Mac      4  1  6  5  3
Ubuntu   3  4  5  6  3
Windows  3  2  6  6  3

Answer 3

像這樣嗎 您可以將數據幀寫入csv文件以獲得所需的格式。

import pandas as pd
# df0=pd.read_clipboard(sep=',')
# df0
df=df0.copy()
df=df.groupby(by='OS').sum()
print df

輸出：

         A  B  C  D  E
OS                    
Mac      4  1  6  5  3
Ubuntu   3  4  5  6  3
Windows  3  2  6  6  3

df.to_csv('file01')

文件01

OS,A,B,C,D,E
Mac,4,1,6,5,3
Ubuntu,3,4,5,6,3
Windows,3,2,6,6,3

Answer 4

您之所以遇到該異常，是因為d第一次不存在row['OS'] ，因此d[row['OS']]中不存在'A' 。 請嘗試以下解決此問題：

import csv
from collections import defaultdict

d = defaultdict(dict)

with open('dummy.csv') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        d[row['OS']]['A'] = d[row['OS']]['A'] + int(row['A']) if (row['OS'] in d and 'A' in d[row['OS']]) else int(row['A'])
        d[row['OS']]['B'] = d[row['OS']]['B'] + int(row['B']) if (row['OS'] in d and 'B' in d[row['OS']]) else int(row['B'])
        d[row['OS']]['C'] = d[row['OS']]['C'] + int(row['C']) if (row['OS'] in d and 'C' in d[row['OS']]) else int(row['C'])
        d[row['OS']]['D'] = d[row['OS']]['D'] + int(row['D']) if (row['OS'] in d and 'D' in d[row['OS']]) else int(row['D'])
        d[row['OS']]['E'] = d[row['OS']]['E'] + int(row['E']) if (row['OS'] in d and 'E' in d[row['OS']]) else int(row['E'])

輸出：

>>> import pprint
>>>
>>> pprint.pprint(dict(d))
{'Mac': {'A': 4, 'B': 1, 'C': 6, 'D': 5, 'E': 3},
 'Ubuntu': {'A': 3, 'B': 4, 'C': 5, 'D': 6, 'E': 3},
 'Windows': {'A': 3, 'B': 2, 'C': 6, 'D': 6, 'E': 3}}

Answer 5

d是字典，因此d[row['OS']]是有效的表達式，但是d[row['OS']]['A']期望該字典項是某種集合。 由於您未提供默認值，因此它將是None ，而不是。

Answer 6

這擴展了niemmi的解決方案，以將輸出格式化為與OP的示例相同：

import csv
from collections import Counter, defaultdict

d = defaultdict(Counter)
with open('dummy.csv') as csv_file:
    reader = csv.DictReader(csv_file)
    field_names = reader.fieldnames
    for row in reader:
        counter = d[row.pop('OS')]
        for key, value in row.iteritems():
            counter[key] += int(value)

print ','.join(field_names)
for os, counter in sorted(d.iteritems()):
    print "%s,%s" % (os, ','.join([str(v) for k, v in sorted(counter.iteritems())]))

輸出量

OS,A,B,C,D,E
Mac,4,1,6,5,3
Ubuntu,3,4,5,6,3
Windows,3,2,6,6,3

更新：修復了輸出。

Answer 7

我假設您的輸入文件名為input_file.csv 。

您還可以使用itertools模塊中的groupby和two dicts來處理數據並獲得所需的輸出，如下例所示：

from itertools import groupby

data = list(k.strip("\n").split(",") for k in open("input_file.csv", 'r'))

a, b = {}, {}
for k, v in groupby(data[1:], lambda x : x[0]):
    try:
        a[k] += [i[1:] for i in list(v)]
    except KeyError:
        a[k] = [i[1:] for i in list(v)]

for key in a.keys():
    for j in range(5):
        c = 0
        for i in a[key]:
            c += int(i[j])
        try:
            b[key] += ',' + str(c) 
        except KeyError:
            b[key] = str(c)

輸出：

print(','.join(data[0]))
for k in b.keys():
    print("{0},{1}".format(k, b[k]))

>>> OS,A,B,C,D,E
>>> Ubuntu,3,4,5,6,3
>>> Windows,3,2,6,6,3
>>> Mac,4,1,6,5,3

Python + CSV：從CSV列中匯總相似的值

問題描述

7 個解決方案

解決方案1
1 已采納 2016-12-28 13:53:08

解決方案2
1 2016-12-28 13:59:01

解決方案3
1 2016-12-28 13:59:32

解決方案4
1 2016-12-28 14:17:30

解決方案5
0 2016-12-28 13:51:59

解決方案6
0 2016-12-28 15:00:10

解決方案7
0 2016-12-28 21:28:41

Python + CSV：從CSV列中匯總相似的值

問題描述

7 個解決方案

解決方案1 1 已采納 2016-12-28 13:53:08

解決方案2 1 2016-12-28 13:59:01

解決方案3 1 2016-12-28 13:59:32

解決方案4 1 2016-12-28 14:17:30

解決方案5 0 2016-12-28 13:51:59

解決方案6 0 2016-12-28 15:00:10

解決方案7 0 2016-12-28 21:28:41

解決方案1
1 已采納 2016-12-28 13:53:08

解決方案2
1 2016-12-28 13:59:01

解決方案3
1 2016-12-28 13:59:32

解決方案4
1 2016-12-28 14:17:30

解決方案5
0 2016-12-28 13:51:59

解決方案6
0 2016-12-28 15:00:10

解決方案7
0 2016-12-28 21:28:41