[英]aggregate by value and count, distinct array
假設我有這個元組列表
[
('r', 'p', ['A', 'B']),
('r', 'f', ['A']),
('r', 'e', ['A']),
('r', 'p', ['A']),
('r', 'f', ['B']),
('r', 'p', ['B']),
('r', 'e', ['B']),
('r', 'c', ['A'])
]
需要返回按元組中第二個值聚合(group by)的元組列表,統計聚合的個數。 對於第三個值,它是一個數組,需要區分和聚合它。
所以對於上面的例子,結果將是:
[
('r', 'p', ['A', 'B'], 4),
('r', 'f', ['A', 'B'], 2),
('r', 'e', ['A', 'B'], 2),
('r', 'c', ['A'], 1)
]
結果,第一個值是一個常量,第二個是唯一的(它被分組)第三個是不同的分組數組,第四個是數組值的計數,如果我們對它們進行分組
您可以在 pandas 中執行此操作
import pandas as pd
df = pd.DataFrame([
('r', 'p', ['A', 'B']),
('r', 'f', ['A']),
('r', 'e', ['A']),
('r', 'p', ['A']),
('r', 'f', ['B']),
('r', 'p', ['B']),
('r', 'e', ['B']),
('r', 'c', ['A'])
], columns=['first','second','arr'])
pd.merge(df.explode('arr').groupby(['first','second']).agg(set).reset_index(),
df[['first','second']].value_counts().reset_index(),
on=['first','second']).values.tolist()
Output
[
['r', 'c', {'A'}, 1],
['r', 'e', {'B', 'A'}, 2],
['r', 'f', {'B', 'A'}, 2],
['r', 'p', {'B', 'A'}, 3]
]
要解決您的編輯問題,您可以這樣做:
(
df.explode('arr')
.value_counts()
.reset_index()
.groupby(['first','second'])
.agg({'arr':set, 0:sum})
.reset_index()
.values
.tolist()
)
Output
[
['r', 'c', {'A'}, 1],
['r', 'e', {'B', 'A'}, 2],
['r', 'f', {'B', 'A'}, 2],
['r', 'p', {'B', 'A'}, 4]
]
這是我使用itertools
的嘗試。
from itertools import groupby
data = [
('r', 'p', ['A', 'B']),
('r', 'f', ['A']),
('r', 'e', ['A']),
('r', 'p', ['A']),
('r', 'f', ['B']),
('r', 'p', ['B']),
('r', 'e', ['B']),
('r', 'c', ['A'])
]
# groupby needs sorted data
data.sort(key=lambda x: (x[0], x[1]))
result = []
for key,group in groupby(data, key=lambda x: (x[0], x[1])):
# Make the AB list. Ex: s = ['A', 'B', 'A', 'B']
s = [item for x in group for item in x[2]]
# Put it all together. Ex: ('r', 'p', ['A', 'B'], 4)
result.append(tuple(list(key) + [list(set(s))] + [len(s)]))
我希望我已經很好地理解了你的問題:
data = [
("r", "p", ["A", "B"]),
("r", "f", ["A"]),
("r", "e", ["A"]),
("r", "p", ["A"]),
("r", "f", ["B"]),
("r", "p", ["B"]),
("r", "e", ["B"]),
("r", "c", ["A"]),
]
out = {}
for a, b, c in data:
out.setdefault((a, b), []).append(c)
out = [
(a, b, list(set(v for l in c for v in l)), sum(map(len, c)))
for (a, b), c in out.items()
]
print(out)
印刷:
[
("r", "p", ["B", "A"], 4),
("r", "f", ["B", "A"], 2),
("r", "e", ["B", "A"], 2),
("r", "c", ["A"], 1),
]
convtools支持自定義聚合(我必須承認,我是作者),所以這是代碼:
from convtools import conversion as c
data = [
("r", "p", ["A", "B"]),
("r", "f", ["A"]),
("r", "e", ["A"]),
("r", "p", ["A"]),
("r", "f", ["B"]),
("r", "p", ["B"]),
("r", "e", ["B"]),
("r", "c", ["A"]),
]
converter = (
c.group_by(c.item(1))
.aggregate(
(
c.ReduceFuncs.First(c.item(0)),
c.item(1),
c.reduce(
lambda x, y: x.union(y),
c.item(2).as_type(set),
initial=set,
default=set,
).as_type(list),
c.ReduceFuncs.Sum(c.item(2).len()),
)
)
.gen_converter() # generates ad-hoc python function; reuse if needed
)
output 是:
In [47]: converter(data)
Out[47]:
[('r', 'p', ['B', 'A'], 4),
('r', 'f', ['B', 'A'], 2),
('r', 'e', ['B', 'A'], 2),
('r', 'c', ['A'], 1)]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.