主要根据列表大小但其次根据条件在单独列表中拆分字典列表

Question

我目前有一个看起来像这样的字典列表：

total_list = [
    {'email': 'usera@email.com', 'id': 1, 'country': 'UK'},
    {'email': 'usera@email.com', 'id': 1, 'country': 'Germany'}, 
    {'email': 'userb@email.com', 'id': 2, 'country': 'UK'}
    {'email': 'userc@email.com', 'id': 3, 'country': 'Italy'},
    {'email': 'userc@email.com', 'id': 3, 'country': 'Netherland'},
    {'email': 'userd@email.com', 'id': 4, 'country': 'France'},
    ...
]

我想主要根据大小拆分它，假设新的大小列表是每个列表 3 个项目，但我还想确保所有相同的用户都在同一个新子列表中。

所以我要创建的结果是：

list_a = [
    {'email': 'usera@email.com', 'id': 1, 'country': 'UK'},
    {'email': 'userb@email.com', 'id': 2, 'country': 'UK'}    
    {'email': 'usera@email.com', 'id': 1, 'country': 'Germany'}
]
  
list_b = [
    {'email': 'userc@email.com', 'id': 3, 'country': 'Italy'},
    {'email': 'userd@email.com', 'id': 4, 'country': 'France'}
    {'email': 'userc@email.com', 'id': 3, 'country': 'Netherland'},
    ...
]

显然，在我提供的示例中，用户在列表中的位置非常接近，但实际上，它们可以分散得更多。 我正在考虑根据电子邮件对列表进行排序，然后将它们拆分，但我不确定如果应该组合在一起的项目恰好位于主列表将被拆分的确切位置，会发生什么情况。

到目前为止我尝试过的是：

def list_splitter(main_list, size):
    for i in range(0, len(main_list), size):
        yield main_list[i:i + size]

# calculating the needed number of sublists
max_per_batch = 3
number_of_sublists = ceil(len(total_list) / max_per_batch)

# sort the data by email
total_list.sort(key=lambda x: x['email'])

sublists = list(list_splitter(main_list=total_list, size=max_per_batch))

问题是，根据这个逻辑，我不能 100%确保如果有任何具有相同电子邮件值的项目，它们将最终出现在同一个子列表中。 由于排序，可能会发生这种情况，但不确定。

基本上，我需要一种方法来确保具有相同email的项目始终位于同一子列表中，但拆分的主要条件是子列表大小。

Answer 1

该解决方案首先仅处理所有电子邮件的列表。 然后根据电子邮件的频率和组大小limit对电子邮件进行分组。 稍后剩余的数据，即id和country ，被加入到电子邮件组中。

第一个函数create_groups作用于电子邮件列表。 它计算每封电子邮件的出现次数并将它们分组。 每个新组都从最常用的电子邮件开始。 如果组中还有剩余空间，它会寻找最常见的也适合该组的空间。 如果存在这样的项目，则将其添加到组中。

重复此操作直到该组已满； 然后，开始一个新组。

from operator import itemgetter
from itertools import groupby, chain
from collections import Counter


def create_groups(items, group_size_limit):
    # Count the frequency of all items and create a list of items 
    # sorted by descending frequency
    items_not_grouped = Counter(items).most_common()
    groups = []

    while items_not_grouped:
        # Start a new group with the most frequent ungrouped item
        item, count = items_not_grouped.pop(0)
        group, group_size = [item], count
        while group_size < group_size_limit:
            # If there is room left in the group, look for a new group member
            for index, (candidate, candidate_count) \
                    in enumerate(items_not_grouped):
                if candidate_count <= group_size_limit - group_size:
                    # If the candidate fits, add it to the group
                    group.append(candidate)
                    group_size += candidate_count
                    # ... and remove it from the items not grouped
                    items_not_grouped.pop(index)
                    break
            else:
                # If the for loop did not break, no items fit in the group
                break

        groups.append(group)

    return groups

这是在您的示例中使用该函数的结果：

users = [
    {'email': 'usera@email.com', 'id': 1, 'country': 'UK',},
    {'email': 'userb@email.com', 'id': 2, 'country': 'UK'},
    {'email': 'usera@email.com', 'id': 1, 'country': 'Germany'},
    {'email': 'userc@email.com', 'id': 3, 'country': 'Italy'},
    {'email': 'userd@email.com', 'id': 4, 'country': 'France'},
    {'email': 'userc@email.com', 'id': 3, 'country': 'Netherland'}
]

emails = [user["email"] for user in users]
email_groups = create_groups(emails, 3)
# -> [
#   ['usera@email.com', 'userb@email.com'], 
#   ['userc@email.com', 'userd@email.com']
# ]

最后，创建组后，函数join_data_on_groups将原始用户字典分组。 它以之前的电子邮件组和字典列表作为参数：

def join_data_on_groups(groups, item_to_data):
    item_to_data = {item: list(data) for item, data in item_to_data}

    groups = [(item_to_data[item] for item in group) for group in groups]
    groups = [list(chain(*group)) for group in groups]

    return groups


email_getter = itemgetter("email")
users_grouped_by_email = groupby(sorted(users, key=email_getter), email_getter)

user_groups = join_data_on_groups(email_groups, users_grouped_by_email)

print(user_groups)

结果：

[
  [
    {'email': 'usera@email.com', 'id': 1, 'country': 'UK'},
    {'email': 'usera@email.com', 'id': 1, 'country': 'Germany'}, 
    {'email': 'userb@email.com', 'id': 2, 'country': 'UK'}
  ],
  [
    {'email': 'userc@email.com', 'id': 3, 'country': 'Italy'},
    {'email': 'userc@email.com', 'id': 3, 'country': 'Netherland'},
    {'email': 'userd@email.com', 'id': 4, 'country': 'France'}
  ]
]

Answer 2

我会考虑使用队列或 fifo 类型并弹出元素以供使用，而不是将字典保存在列表中。 但是使用你所拥有的你可以先创建一个新的排序列表然后做你正在做的事情（有点），或者这是另一个解决方案，因为有很多解决方案可以以任何可以想象的方式组织数据（事实上，你的约束是不同的你想为每个输出对象分配一个变量名吗？我会忽略那部分）：

创建类型为 str:list 的字典 D，其中您的键是用户电子邮件，列表是total_list中所有字典条目的列表，最初为空[] 。 如果您有大量数据，则排队/生成器会更好，但重点是过滤/格式化您的输入。
将您的total_list解析为 D，因此每次点击相同的用户电子邮件时，您都会将该字典附加到该键的值列表中。 total_list可以删除。
现在解析 D，用字典列表形成输出列表（或生成器），每个列表限制为 3 个字典。 这可能是一个类似于您现在拥有的发电机。

Answer 3

一般解决方案（下面的解释）：

import pandas as pd
import numpy as np
from numberpartitioning import karmarkar_karp

def solution(data, groupby: str, partition_size: int):
    df = pd.DataFrame(data)
    groups = df.groupby([groupby]).count()
    groupby_counts = groups.iloc[:, 0].values
    num_parts = len(df) // partition_size
    result = karmarkar_karp(groupby_counts, num_parts=num_parts, return_indices=True)
    part_keys = groups.index.values[np.array(result.partition)]
    partitions = [df.loc[df[groupby].isin(key)].to_dict('records') for key in part_keys]
    return partitions


solution(total_list, groupby="email", partition_size=3)

给出了一个有效的解决方案（尽管与您的示例解决方案略有不同）

[[{'country': 'UK', 'email': 'userb@email.com', 'id': 2},
  {'country': 'Italy', 'email': 'userc@email.com', 'id': 3},
  {'country': 'Netherland', 'email': 'userc@email.com', 'id': 3}],
 [{'country': 'UK', 'email': 'usera@email.com', 'id': 1},
  {'country': 'Germany', 'email': 'usera@email.com', 'id': 1},
  {'country': 'France', 'email': 'userd@email.com', 'id': 4}]]

解释

我们可以使用分区算法，例如Karmarkar-Karp 算法。 它将一组数字分成k个分区，使得每个分区的总和尽可能接近。 已经存在一个纯 Python 实现numberpartition 。 只是python3 -m pip install numberpartitioning 。

该算法仅适用于数字，但我们可以仅使用每组电子邮件的数量对电子邮件组进行编码。 让我们使用数据框来保存您的数据：

>>> df = pd.DataFrame(total_list)

然后找到按电子邮件分组的计数：

>>> email_counts = df.groupby(["email"])["id"].count().rename("count")

例如， total_list的组计数：

>>> email_counts
email
usera@email.com    2
userb@email.com    1
userc@email.com    2
userd@email.com    1
Name: count, dtype: int64

在您的示例中，我们希望每个分区有 3 个条目（因此partition_size=3 ），这意味着分区数为num_parts = len(total_list)/partition_size = 2

那么如果我们执行karmarkar_karp([2, 1, 2, 1], num_parts=True) ，我们得到以下分区[[2, 1], [2, 1]]和分区大小[3, 3] 。

但我们不关心计数，我们关心哪个电子邮件与每个计数相关联。 所以，我们简单地返回索引：

>>> result = karmarkar_karp(email_counts.values, num_parts=2, return_indices=True)
>>> result
PartitioningResult(partition=[[2, 1], [0, 3]], sizes=[3, 3])

根据指数，分组是：

partition 1: indices [2, 1] -> [userc, userb]
partition 2: indices [0, 3] -> [usera, userd]

这与您写的略有不同，但仍然是一个有效的解决方案。

我们通过运行找到电子邮件分区：

>>> email_partitions = email_counts.index.values[np.array(result.partition)]

给定电子邮件分区，我们现在只需要根据它属于哪个分区来拆分total_list中的每个条目。

>>> partitions = [df.loc[df["email"].isin(emails)].to_dict('records') for emails in email_partitions]

然后打印partitions ，我们有：

>>> partitions
[[{'email': 'userb@email.com', 'id': 2, 'country': 'UK'},
  {'email': 'userc@email.com', 'id': 3, 'country': 'Italy'},
  {'email': 'userc@email.com', 'id': 3, 'country': 'Netherland'}],
 [{'email': 'usera@email.com', 'id': 1, 'country': 'UK'},
  {'email': 'usera@email.com', 'id': 1, 'country': 'Germany'},
  {'email': 'userd@email.com', 'id': 4, 'country': 'France'}]]

主要根据列表大小但其次根据条件在单独列表中拆分字典列表

问题描述

3 个解决方案

解决方案1
3 2022-11-12 09:57:24

解决方案2
0 2022-11-09 04:45:51

解决方案3
0 2022-11-18 19:55:56

一般解决方案（下面的解释）：

解释

主要根据列表大小但其次根据条件在单独列表中拆分字典列表

问题描述

3 个解决方案

解决方案1 3 2022-11-12 09:57:24

解决方案2 0 2022-11-09 04:45:51

解决方案3 0 2022-11-18 19:55:56

一般解决方案（下面的解释）：

解释

解决方案1
3 2022-11-12 09:57:24

解决方案2
0 2022-11-09 04:45:51

解决方案3
0 2022-11-18 19:55:56