簡體   English   中英

按值和計數聚合,不同的數組

[英]aggregate by value and count, distinct array

假設我有這個元組列表

[
('r', 'p', ['A', 'B']),
('r', 'f', ['A']),
('r', 'e', ['A']),
('r', 'p', ['A']),
('r', 'f', ['B']),
('r', 'p', ['B']),
('r', 'e', ['B']),
('r', 'c', ['A'])
]

需要返回按元組中第二個值聚合(group by)的元組列表,統計聚合的個數。 對於第三個值,它是一個數組,需要區分和聚合它。

所以對於上面的例子,結果將是:

[
('r', 'p', ['A', 'B'], 4),
('r', 'f', ['A', 'B'], 2),
('r', 'e', ['A', 'B'], 2),
('r', 'c', ['A'], 1)
]

結果,第一個值是一個常量,第二個是唯一的(它被分組)第三個是不同的分組數組,第四個是數組值的計數,如果我們對它們進行分組

您可以在 pandas 中執行此操作

import pandas as pd

df = pd.DataFrame([
('r', 'p', ['A', 'B']),
('r', 'f', ['A']),
('r', 'e', ['A']),
('r', 'p', ['A']),
('r', 'f', ['B']),
('r', 'p', ['B']),
('r', 'e', ['B']),
('r', 'c', ['A'])
], columns=['first','second','arr'])

pd.merge(df.explode('arr').groupby(['first','second']).agg(set).reset_index(),
         df[['first','second']].value_counts().reset_index(),
         on=['first','second']).values.tolist()

Output

[
    ['r', 'c', {'A'}, 1],
    ['r', 'e', {'B', 'A'}, 2],
    ['r', 'f', {'B', 'A'}, 2],
    ['r', 'p', {'B', 'A'}, 3]
]

要解決您的編輯問題,您可以這樣做:

(
  df.explode('arr')
    .value_counts()
    .reset_index()
    .groupby(['first','second'])
    .agg({'arr':set, 0:sum})
    .reset_index()
    .values
    .tolist()
)

Output

[
   ['r', 'c', {'A'}, 1],
   ['r', 'e', {'B', 'A'}, 2],
   ['r', 'f', {'B', 'A'}, 2],
   ['r', 'p', {'B', 'A'}, 4]
]

這是我使用itertools的嘗試。

from itertools import groupby

data = [
('r', 'p', ['A', 'B']),
('r', 'f', ['A']),
('r', 'e', ['A']),
('r', 'p', ['A']),
('r', 'f', ['B']),
('r', 'p', ['B']),
('r', 'e', ['B']),
('r', 'c', ['A'])
]

# groupby needs sorted data
data.sort(key=lambda x: (x[0], x[1]))
result = []
for key,group in groupby(data, key=lambda x: (x[0], x[1])):
    # Make the AB list. Ex: s = ['A', 'B', 'A', 'B']
    s = [item for x in group for item in x[2]]
    # Put it all together. Ex: ('r', 'p', ['A', 'B'], 4)
    result.append(tuple(list(key) + [list(set(s))] + [len(s)]))

我希望我已經很好地理解了你的問題:

data = [
    ("r", "p", ["A", "B"]),
    ("r", "f", ["A"]),
    ("r", "e", ["A"]),
    ("r", "p", ["A"]),
    ("r", "f", ["B"]),
    ("r", "p", ["B"]),
    ("r", "e", ["B"]),
    ("r", "c", ["A"]),
]

out = {}
for a, b, c in data:
    out.setdefault((a, b), []).append(c)

out = [
    (a, b, list(set(v for l in c for v in l)), sum(map(len, c)))
    for (a, b), c in out.items()
]

print(out)

印刷:

[
    ("r", "p", ["B", "A"], 4),
    ("r", "f", ["B", "A"], 2),
    ("r", "e", ["B", "A"], 2),
    ("r", "c", ["A"], 1),
]

convtools支持自定義聚合(我必須承認,我是作者),所以這是代碼:

from convtools import conversion as c

data = [
    ("r", "p", ["A", "B"]),
    ("r", "f", ["A"]),
    ("r", "e", ["A"]),
    ("r", "p", ["A"]),
    ("r", "f", ["B"]),
    ("r", "p", ["B"]),
    ("r", "e", ["B"]),
    ("r", "c", ["A"]),
]

converter = (
    c.group_by(c.item(1))
    .aggregate(
        (
            c.ReduceFuncs.First(c.item(0)),
            c.item(1),
            c.reduce(
                lambda x, y: x.union(y),
                c.item(2).as_type(set),
                initial=set,
                default=set,
            ).as_type(list),
            c.ReduceFuncs.Sum(c.item(2).len()),
        )
    )
    .gen_converter()  # generates ad-hoc python function; reuse if needed
)

output 是:

In [47]: converter(data)
Out[47]:
[('r', 'p', ['B', 'A'], 4),
 ('r', 'f', ['B', 'A'], 2),
 ('r', 'e', ['B', 'A'], 2),
 ('r', 'c', ['A'], 1)]

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM