簡體   English   中英

Python + CSV:從CSV列中匯總相似的值

[英]Python + CSV : Sum up Similar Values from a CSV columns

輸入文件:

$ cat dummy.csv 
OS,A,B,C,D,E
Ubuntu,0,1,0,1,1
Windows,0,0,1,1,1
Mac,1,0,1,0,0
Ubuntu,1,1,1,1,0
Windows,0,0,1,1,0
Mac,1,0,1,1,1
Ubuntu,0,1,0,1,1
Ubuntu,0,0,1,1,1
Ubuntu,1,0,1,0,0
Ubuntu,1,1,1,1,0
Mac,0,0,1,1,0
Mac,1,0,1,1,1
Windows,1,1,1,1,0
Ubuntu,0,0,1,1,0
Windows,1,0,1,1,1
Mac,0,1,0,1,1
Windows,0,0,1,1,1
Mac,1,0,1,0,0
Windows,1,1,1,1,0
Mac,0,0,1,1,0

預期產量:

OS,A,B,C,D,E
Mac,4,1,6,5,3
Ubuntu,3,4,5,6,3
Windows,3,2,6,6,3

我使用Excel的數據透視表生成了上述輸出。

我的代碼:

import csv
import pprint
from collections import defaultdict

d = defaultdict(dict)

with open('dummy.csv') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        d[row['OS']]['A'] += row['A']
        d[row['OS']]['B'] += row['B']
        d[row['OS']]['C'] += row['C']
        d[row['OS']]['D'] += row['D']
        d[row['OS']]['E'] += row['E']

pprint.pprint(d)

錯誤:

$ python3 dummy.py
Traceback (most recent call last):
  File "dummy.py", line 10, in <module>
    d[row['OS']]['A'] += row['A']
KeyError: 'A'

我的想法是將CSV值累加到字典中,然后打印出來。 但是,當我嘗試添加值時,出現上述錯誤。

這似乎可以通過內置的csv模塊來實現。 我認為這是一個簡單的方法:(任何指針都會有很大幫助。

有兩個問題。 嵌套字典最初沒有設置任何鍵,因此d[row[OS]]['A']結果為錯誤。 另一個問題是您需要在添加列值之前將其轉換為int

您可以使用Counter作為defaultdict值,因為缺少的鍵默認為0

import csv
from collections import Counter, defaultdict

d = defaultdict(Counter)

with open('dummy.csv') as csvfile:
    reader = csv.DictReader(csvfile)

    for row in reader:
        nested = d[row.pop('OS')]
        for k, v in row.items():
            nested[k] += int(v)

print(*d.items(), sep='\n')

輸出:

('Ubuntu', Counter({'D': 6, 'C': 5, 'B': 4, 'E': 3, 'A': 3}))
('Windows', Counter({'C': 6, 'D': 6, 'E': 3, 'A': 3, 'B': 2}))
('Mac', Counter({'C': 6, 'D': 5, 'A': 4, 'E': 3, 'B': 1}))

這不能完全回答您的問題,因為使用csv確實可以解決問題,但是值得一提的是, pandas非常適合此類情況:

In [1]: import pandas as pd

In [2]: df = pd.read_csv('dummy.csv')

In [3]: df.groupby('OS').sum()
Out[3]:
         A  B  C  D  E
OS
Mac      4  1  6  5  3
Ubuntu   3  4  5  6  3
Windows  3  2  6  6  3

像這樣嗎 您可以將數據幀寫入csv文件以獲得所需的格式。

import pandas as pd
# df0=pd.read_clipboard(sep=',')
# df0
df=df0.copy()
df=df.groupby(by='OS').sum()
print df

輸出:

         A  B  C  D  E
OS                    
Mac      4  1  6  5  3
Ubuntu   3  4  5  6  3
Windows  3  2  6  6  3

df.to_csv('file01')

文件01

OS,A,B,C,D,E
Mac,4,1,6,5,3
Ubuntu,3,4,5,6,3
Windows,3,2,6,6,3

您之所以遇到該異常,是因為d第一次不存在row['OS'] ,因此d[row['OS']]中不存在'A' 請嘗試以下解決此問題:

import csv
from collections import defaultdict

d = defaultdict(dict)

with open('dummy.csv') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        d[row['OS']]['A'] = d[row['OS']]['A'] + int(row['A']) if (row['OS'] in d and 'A' in d[row['OS']]) else int(row['A'])
        d[row['OS']]['B'] = d[row['OS']]['B'] + int(row['B']) if (row['OS'] in d and 'B' in d[row['OS']]) else int(row['B'])
        d[row['OS']]['C'] = d[row['OS']]['C'] + int(row['C']) if (row['OS'] in d and 'C' in d[row['OS']]) else int(row['C'])
        d[row['OS']]['D'] = d[row['OS']]['D'] + int(row['D']) if (row['OS'] in d and 'D' in d[row['OS']]) else int(row['D'])
        d[row['OS']]['E'] = d[row['OS']]['E'] + int(row['E']) if (row['OS'] in d and 'E' in d[row['OS']]) else int(row['E'])

輸出:

>>> import pprint
>>>
>>> pprint.pprint(dict(d))
{'Mac': {'A': 4, 'B': 1, 'C': 6, 'D': 5, 'E': 3},
 'Ubuntu': {'A': 3, 'B': 4, 'C': 5, 'D': 6, 'E': 3},
 'Windows': {'A': 3, 'B': 2, 'C': 6, 'D': 6, 'E': 3}}

d是字典,因此d[row['OS']]是有效的表達式,但是d[row['OS']]['A']期望該字典項是某種集合。 由於您未提供默認值,因此它將是None ,而不是。

這擴展了niemmi的 解決方案,以將輸出格式化為與OP的 示例相同:

import csv
from collections import Counter, defaultdict

d = defaultdict(Counter)
with open('dummy.csv') as csv_file:
    reader = csv.DictReader(csv_file)
    field_names = reader.fieldnames
    for row in reader:
        counter = d[row.pop('OS')]
        for key, value in row.iteritems():
            counter[key] += int(value)

print ','.join(field_names)
for os, counter in sorted(d.iteritems()):
    print "%s,%s" % (os, ','.join([str(v) for k, v in sorted(counter.iteritems())]))

輸出量

OS,A,B,C,D,E
Mac,4,1,6,5,3
Ubuntu,3,4,5,6,3
Windows,3,2,6,6,3

更新:修復了輸出。

我假設您的輸入文件名為input_file.csv

您還可以使用itertools模塊中的groupbytwo dicts來處理數據並獲得所需的輸出,如下例所示:

from itertools import groupby

data = list(k.strip("\n").split(",") for k in open("input_file.csv", 'r'))

a, b = {}, {}
for k, v in groupby(data[1:], lambda x : x[0]):
    try:
        a[k] += [i[1:] for i in list(v)]
    except KeyError:
        a[k] = [i[1:] for i in list(v)]

for key in a.keys():
    for j in range(5):
        c = 0
        for i in a[key]:
            c += int(i[j])
        try:
            b[key] += ',' + str(c) 
        except KeyError:
            b[key] = str(c)

輸出:

print(','.join(data[0]))
for k in b.keys():
    print("{0},{1}".format(k, b[k]))

>>> OS,A,B,C,D,E
>>> Ubuntu,3,4,5,6,3
>>> Windows,3,2,6,6,3
>>> Mac,4,1,6,5,3

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM