[英]Group and Sum Multiple Columns without Pandas
我有一個包含多列的列表,我需要根據兩列對行進行分組和求和。 我可以在不使用 Pandas dataframe 的情況下執行此操作嗎?
我在這樣的列表中有一個數據集:
User Days Project
Dave 3 Red
Dave 4 Red
Dave 2 Blue
Sue 4 Red
Sue 1 Red
Sue 3 Yellow
具體來說: [[Dave, 3, Red], [Dave, 4, Red], [Dave, 2, Blue], [Sue, 4, Red], [Sue, 1, Red], [Sue, 3, Yellow]]
我想要做的是在同一行上的 output 一些總數是這樣的:
User Days Project UserDays ProjectDaysPerUser
Dave 3 Red 9 7
Dave 4 Red 9 7
Dave 2 Blue 9 2
Sue 4 Red 8 5
Sue 1 Red 8 5
Sue 3 Yellow 8 3
所以我試圖分組兩次以獲得“ ProjectDaysPerUser ”,首先是用戶,然后是項目。 正是這種雙重分組讓我失望。
有沒有一種簡單的方法可以在不創建熊貓 dataframe 的情況下做到這一點?
下面的腳本使用 groupby 並將總和的結果附加到列表中。
from itertools import groupby
data = [['Dave', 3, 'Red'], ['Dave', 4, 'Red'], ['Dave', 2, 'Blue'], ['Sue', 4, 'Red'], ['Sue', 1, 'Red'], ['Sue', 3, 'Yellow']]
new_data, final = [], []
userDays=[[k, sum(v[1] for v in g)] for k, g in groupby(data, key = lambda x: x[0])]
projuserDays=[[k, sum(v[1] for v in g)] for k, g in groupby(data, key = lambda x: (x[0], x[2]))]
#add userDays and projectuserdays
for d in data:
for u in userDays:
if d[0]==u[0]:
d.append(u[1])
new_data.append(d)
for p in projuserDays:
if d[0]==p[0][0] and d[2]==p[0][1]:
d.append(p[1])
final.append(d)
print(final)
Result:
[['Dave', 3, 'Red', 9, 7],
['Dave', 4, 'Red', 9, 7],
['Dave', 2, 'Blue', 9, 2],
['Sue', 4, 'Red', 8, 5],
['Sue', 1, 'Red', 8, 5],
['Sue', 3, 'Yellow', 8, 3]]
使用字典提高性能
data = [['Dave', 3, 'Red'], ['Dave', 2, 'Blue'], ['Sue', 4, 'Red'], ['Dave', 4, 'Red'], ['Sue', 1, 'Red'], ['Sue', 3, 'Yellow']]
sum_dict = {}
for d in data:
sum_dict[d[0]] = sum_dict.get(d[0], 0) + d[1]
sum_dict[(d[0], d[2])] = sum_dict.get((d[0], d[2]), 0) + d[1]
for d in data:
d.append(sum_dict[d[0]])
d.append(sum_dict[(d[0], d[2])])
print(d)
因為你在做總和,這也可以用collections.Counter
很好地解決:
from collections import Counter
data = [['Dave', 3, 'Red'], ['Dave', 4, 'Red'], ['Dave', 2, 'Blue'], ['Sue', 4, 'Red'], ['Sue', 1, 'Red'], ['Sue', 3, 'Yellow']]
user_days = Counter()
project_user_days = Counter()
for (name, num_days, project) in data:
user_days[name] += num_days
project_user_days[(name, project)] += num_days
derived_data = [
[name, num_days, project, user_days[name], project_user_days[(name, project)]]
for (name, num_days, project) in data
]
import pprint
pprint.pprint(derived_data)
# [['Dave', 3, 'Red', 9, 7],
# ['Dave', 4, 'Red', 9, 7],
# ['Dave', 2, 'Blue', 9, 2],
# ['Sue', 4, 'Red', 8, 5],
# ['Sue', 1, 'Red', 8, 5],
# ['Sue', 3, 'Yellow', 8, 3]]
高效代碼
import itertools
def group_data(input1:list)->list:
name_dict = {k : sum(v[1] for v in g) for k, g in itertools.groupby(sorted(input1, key=lambda x:x[0]), key=lambda x:x[0])}
name_colour_dict = {k: sum(v[1] for v in g) for k,g in itertools.groupby(sorted(input1, key=lambda x:(x[0], x[2])), key=lambda x:(x[0],x[2]))}
for row in input1:
name = row[0]
name_colour = (row[0], row[2])
row.append(name_dict[name])
row.append(name_colour_dict[name_colour])
print(input1)
group_data([['Dave', 3, 'Red'], ['Dave', 4, 'Red'], ['Dave', 2, 'Blue'], ['Sue', 4, 'Red'], ['Sue', 1, 'Red'], ['Sue', 3, 'Yellow']]
)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.