使用生成器按多个属性对对象列表进行排序

Question

我有一个对象列表，其数量介于数千和数万之间。 这些对象可以被认为是我想要根据他们的分数排名的人。

首先，他们按年龄，性别等分成小组。在每个点，提供与该年龄/性别类别相对应的排名。 对象上的字段是age_group和gender 。 因此，您首先会收集所有拥有30-39岁年龄组的人，然后收集所有年龄组的男性（ M ）和所有女性（ W ）。

在每个点上创建一个新列表是非常占用大量内存的，因此我尝试使用生成器和itertools使用原始列表进行分组。 所以我有一个功能来做到这一点;

def group_standings(_standings, field):
    """ sort list of standings by a given field """
    getter = operator.attrgetter(field)
    for k, g in itertools.groupby(_standings, getter):
        yield list(g)


def calculate_positions(standings):
    """
    sort standings by age_group then gender & set position based on point value 
    """
    for age_group in group_standings(standings, 'age_group'):

        for gender_group in group_standings(age_group, 'gender'):

            set_positions(
                standings=gender_group,
                point_field='points',
                position_field='position',
            )

要使set_positions正常运行，它需要整个组，以便它可以按point_field值排序，然后设置position_field值。

调试生成器， groupby没有按照我的预期收集与键匹配的所有对象。 输出是这样的;

DEBUG generating k 30-39
DEBUG generating g [<Standing object at 0x7fc86fedbe10>, <Standing object at 0x7fc86fedbe50>, <Standing object at 0x7fc86fedbe90>]

DEBUG generating k 20-29
DEBUG generating g [<Standing object at 0x7fc86fedbed0>]

DEBUG generating k 30-39
DEBUG generating g [<Standing object at 0x7fc86fedbf10>]

DEBUG generating k 20-29
DEBUG generating g [<Standing object at 0x7fc86fedbf50>, <Standing object at 0x7fc86fedbf90>, <Standing object at 0x7fc86fedbfd0>, <Standing object at 0x7fc856ecc050>, <Standing object at 0x7fc856ecc090>, <Standing object at 0x7fc856ecc0d0>, <Standing object at 0x7fc856ecc110>, <Standing object at 0x7fc856ecc150>, <Standing object at 0x7fc856ecc190>, <Standing object at 0x7fc856ecc1d0>]

要确认， set_positions起作用，生成器提供的列表将需要包含20-29岁年龄组中的所有对象，但如上所述，在列表的多次迭代中找到该组中的对象。

Answer 1

这是因为groupby函数假定输入iterable已经按键排序（参见文档）。 它是为性能而制造的，但令人困惑。 另外，我不会将g转换为group_standings函数中的列表，但仅在将gender_group传递给set_positions时才应用它。

Answer 2

`groupby`适用于相邻元素

根据@ MikhailBerlinkov的回答， groupby 只聚合相同的连续项 ，可选地使用key参数进行比较。

看一个例子可能会有所帮助：

from itertools import groupby

L = [1, 1, 1, 2, 2, 2, 1, 1]

res = [list(j) for _, j in groupby(L)]

[[1, 1, 1], [2, 2, 2], [1, 1]]

如您所见， 1值的组被分成两个单独的列表。

在分组之前排序

相反，您可以在分组之前对对象列表进行排序。 对于大的对象列表，例如长度为n ，这需要O（ n log n ）时间。 这是一个例子（使用与之前相同的L ）：

L_sorted = sorted(L)

res = [list(j) for i, j in groupby(L_sorted)]

[[1, 1, 1, 1, 1], [2, 2, 2]]

使用生成器按多个属性对对象列表进行排序

问题描述

2 个解决方案

解决方案1
4 已采纳 2018-11-17 15:13:35

解决方案2
1 2018-11-17 17:19:09

`groupby`适用于相邻元素

在分组之前排序

使用生成器按多个属性对对象列表进行排序

问题描述

2 个解决方案

解决方案1 4 已采纳 2018-11-17 15:13:35

解决方案2 1 2018-11-17 17:19:09

groupby适用于相邻元素

在分组之前排序

解决方案1
4 已采纳 2018-11-17 15:13:35

解决方案2
1 2018-11-17 17:19:09

`groupby`适用于相邻元素