简体   繁体   English

按值和计数聚合,不同的数组

[英]aggregate by value and count, distinct array

Let's say i have this list of tuples假设我有这个元组列表

[
('r', 'p', ['A', 'B']),
('r', 'f', ['A']),
('r', 'e', ['A']),
('r', 'p', ['A']),
('r', 'f', ['B']),
('r', 'p', ['B']),
('r', 'e', ['B']),
('r', 'c', ['A'])
]

Need to return a list of tuples that aggregated (group by) by the second value in the tuple, count the number of the aggregation.需要返回按元组中第二个值聚合(group by)的元组列表,统计聚合的个数。 for the third value, which is an array, need to distinct and aggregate it.对于第三个值,它是一个数组,需要区分和聚合它。

So for the example above, the result will be:所以对于上面的例子,结果将是:

[
('r', 'p', ['A', 'B'], 4),
('r', 'f', ['A', 'B'], 2),
('r', 'e', ['A', 'B'], 2),
('r', 'c', ['A'], 1)
]

In the result, the first value is a const, the second is unique (it was grouped by) the third is distinct grouped array, and the forth is the count of values of the array if we grouped them结果,第一个值是一个常量,第二个是唯一的(它被分组)第三个是不同的分组数组,第四个是数组值的计数,如果我们对它们进行分组

You could do this in pandas您可以在 pandas 中执行此操作

import pandas as pd

df = pd.DataFrame([
('r', 'p', ['A', 'B']),
('r', 'f', ['A']),
('r', 'e', ['A']),
('r', 'p', ['A']),
('r', 'f', ['B']),
('r', 'p', ['B']),
('r', 'e', ['B']),
('r', 'c', ['A'])
], columns=['first','second','arr'])

pd.merge(df.explode('arr').groupby(['first','second']).agg(set).reset_index(),
         df[['first','second']].value_counts().reset_index(),
         on=['first','second']).values.tolist()

Output Output

[
    ['r', 'c', {'A'}, 1],
    ['r', 'e', {'B', 'A'}, 2],
    ['r', 'f', {'B', 'A'}, 2],
    ['r', 'p', {'B', 'A'}, 3]
]

To address your edit you could do this:要解决您的编辑问题,您可以这样做:

(
  df.explode('arr')
    .value_counts()
    .reset_index()
    .groupby(['first','second'])
    .agg({'arr':set, 0:sum})
    .reset_index()
    .values
    .tolist()
)

Output Output

[
   ['r', 'c', {'A'}, 1],
   ['r', 'e', {'B', 'A'}, 2],
   ['r', 'f', {'B', 'A'}, 2],
   ['r', 'p', {'B', 'A'}, 4]
]

Here's my attempt using itertools .这是我使用itertools的尝试。

from itertools import groupby

data = [
('r', 'p', ['A', 'B']),
('r', 'f', ['A']),
('r', 'e', ['A']),
('r', 'p', ['A']),
('r', 'f', ['B']),
('r', 'p', ['B']),
('r', 'e', ['B']),
('r', 'c', ['A'])
]

# groupby needs sorted data
data.sort(key=lambda x: (x[0], x[1]))
result = []
for key,group in groupby(data, key=lambda x: (x[0], x[1])):
    # Make the AB list. Ex: s = ['A', 'B', 'A', 'B']
    s = [item for x in group for item in x[2]]
    # Put it all together. Ex: ('r', 'p', ['A', 'B'], 4)
    result.append(tuple(list(key) + [list(set(s))] + [len(s)]))

I hope I've understood your question well:我希望我已经很好地理解了你的问题:

data = [
    ("r", "p", ["A", "B"]),
    ("r", "f", ["A"]),
    ("r", "e", ["A"]),
    ("r", "p", ["A"]),
    ("r", "f", ["B"]),
    ("r", "p", ["B"]),
    ("r", "e", ["B"]),
    ("r", "c", ["A"]),
]

out = {}
for a, b, c in data:
    out.setdefault((a, b), []).append(c)

out = [
    (a, b, list(set(v for l in c for v in l)), sum(map(len, c)))
    for (a, b), c in out.items()
]

print(out)

Prints:印刷:

[
    ("r", "p", ["B", "A"], 4),
    ("r", "f", ["B", "A"], 2),
    ("r", "e", ["B", "A"], 2),
    ("r", "c", ["A"], 1),
]

convtools supports custom aggregations (I must confess, I'm the author), so here's the code: convtools支持自定义聚合(我必须承认,我是作者),所以这是代码:

from convtools import conversion as c

data = [
    ("r", "p", ["A", "B"]),
    ("r", "f", ["A"]),
    ("r", "e", ["A"]),
    ("r", "p", ["A"]),
    ("r", "f", ["B"]),
    ("r", "p", ["B"]),
    ("r", "e", ["B"]),
    ("r", "c", ["A"]),
]

converter = (
    c.group_by(c.item(1))
    .aggregate(
        (
            c.ReduceFuncs.First(c.item(0)),
            c.item(1),
            c.reduce(
                lambda x, y: x.union(y),
                c.item(2).as_type(set),
                initial=set,
                default=set,
            ).as_type(list),
            c.ReduceFuncs.Sum(c.item(2).len()),
        )
    )
    .gen_converter()  # generates ad-hoc python function; reuse if needed
)

The output is: output 是:

In [47]: converter(data)
Out[47]:
[('r', 'p', ['B', 'A'], 4),
 ('r', 'f', ['B', 'A'], 2),
 ('r', 'e', ['B', 'A'], 2),
 ('r', 'c', ['A'], 1)]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM