简体   繁体   English

主要根据列表大小但其次根据条件在单独列表中拆分字典列表

[英]Split list of dictionaries in separate lists based primarily on list size but secondarily based on condition

I currently have a list of dictionaries that looks like that:我目前有一个看起来像这样的字典列表:

total_list = [
    {'email': 'usera@email.com', 'id': 1, 'country': 'UK'},
    {'email': 'usera@email.com', 'id': 1, 'country': 'Germany'}, 
    {'email': 'userb@email.com', 'id': 2, 'country': 'UK'}
    {'email': 'userc@email.com', 'id': 3, 'country': 'Italy'},
    {'email': 'userc@email.com', 'id': 3, 'country': 'Netherland'},
    {'email': 'userd@email.com', 'id': 4, 'country': 'France'},
    ...
]

I want to split it primarily based on size, so let's say that the new size list is 3 items per list, But I also want to make sure that all the same users will be in the same new sublist.我想主要根据大小拆分它,假设新的大小列表是每个列表 3 个项目,但我还想确保所有相同的用户都在同一个新子列表中。

So the result I am trying to create is:所以我要创建的结果是:

list_a = [
    {'email': 'usera@email.com', 'id': 1, 'country': 'UK'},
    {'email': 'userb@email.com', 'id': 2, 'country': 'UK'}    
    {'email': 'usera@email.com', 'id': 1, 'country': 'Germany'}
]
  
list_b = [
    {'email': 'userc@email.com', 'id': 3, 'country': 'Italy'},
    {'email': 'userd@email.com', 'id': 4, 'country': 'France'}
    {'email': 'userc@email.com', 'id': 3, 'country': 'Netherland'},
    ...
]

Obviously in the example that I provided the users were located really close to each other in the list, but in reality, they could be spread way more.显然,在我提供的示例中,用户在列表中的位置非常接近,但实际上,它们可以分散得更多。 I was considering sorting the list based on the email and then splitting them, but I am not sure what happens if the items that are supposed to be grouped together happen to be at the exact location that the main list will be divided.我正在考虑根据电子邮件对列表进行排序,然后将它们拆分,但我不确定如果应该组合在一起的项目恰好位于主列表将被拆分的确切位置,会发生什么情况。

What I have tried so far is:到目前为止我尝试过的是:

def list_splitter(main_list, size):
    for i in range(0, len(main_list), size):
        yield main_list[i:i + size]

# calculating the needed number of sublists
max_per_batch = 3
number_of_sublists = ceil(len(total_list) / max_per_batch)

# sort the data by email
total_list.sort(key=lambda x: x['email'])

sublists = list(list_splitter(main_list=total_list, size=max_per_batch))

The issue is that with this logic I cannot 100% ensure that if there are any items with the same email value they will end up in the same sublist.问题是,根据这个逻辑,我不能 100%确保如果有任何具有相同电子邮件值的项目,它们将最终出现在同一个子列表中。 Because of the sorting, chances are that this will happen, but it is not certain.由于排序,可能会发生这种情况,但不确定。

Basically, I need a method to make sure that items with the same email will always be in the same sublist, but the main condition of the split is the sublist size.基本上,我需要一种方法来确保具有相同email的项目始终位于同一子列表中,但拆分的主要条件是子列表大小。

This solution starts of by only working with the list of all emails.该解决方案首先仅处理所有电子邮件的列表。 The emails are then grouped based on their frequency and the limit on group size.然后根据电子邮件的频率和组大小limit对电子邮件进行分组。 Later the remaining data, ie id and country , are joined back on the email groups.稍后剩余的数据,即idcountry ,被加入到电子邮件组中。

The first function create_groups works on the list of emails.第一个函数create_groups作用于电子邮件列表。 It counts the number of occurrences of each email and groups them.它计算每封电子邮件的出现次数并将它们分组。 Each new group starts with the most frequent email.每个新组都从最常用的电子邮件开始。 If there is room left in the group it looks for the most frequent that also fits in the group.如果组中还有剩余空间,它会寻找最常见的也适合该组的空间。 If such an item exists, it is added to the group.如果存在这样的项目,则将其添加到组中。

This is repeated until the group is full;重复此操作直到该组已满; then, a new group is started.然后,开始一个新组。

from operator import itemgetter
from itertools import groupby, chain
from collections import Counter


def create_groups(items, group_size_limit):
    # Count the frequency of all items and create a list of items 
    # sorted by descending frequency
    items_not_grouped = Counter(items).most_common()
    groups = []

    while items_not_grouped:
        # Start a new group with the most frequent ungrouped item
        item, count = items_not_grouped.pop(0)
        group, group_size = [item], count
        while group_size < group_size_limit:
            # If there is room left in the group, look for a new group member
            for index, (candidate, candidate_count) \
                    in enumerate(items_not_grouped):
                if candidate_count <= group_size_limit - group_size:
                    # If the candidate fits, add it to the group
                    group.append(candidate)
                    group_size += candidate_count
                    # ... and remove it from the items not grouped
                    items_not_grouped.pop(index)
                    break
            else:
                # If the for loop did not break, no items fit in the group
                break

        groups.append(group)

    return groups

This is the result of using that function on your example:这是在您的示例中使用该函数的结果:

users = [
    {'email': 'usera@email.com', 'id': 1, 'country': 'UK',},
    {'email': 'userb@email.com', 'id': 2, 'country': 'UK'},
    {'email': 'usera@email.com', 'id': 1, 'country': 'Germany'},
    {'email': 'userc@email.com', 'id': 3, 'country': 'Italy'},
    {'email': 'userd@email.com', 'id': 4, 'country': 'France'},
    {'email': 'userc@email.com', 'id': 3, 'country': 'Netherland'}
]

emails = [user["email"] for user in users]
email_groups = create_groups(emails, 3)
# -> [
#   ['usera@email.com', 'userb@email.com'], 
#   ['userc@email.com', 'userd@email.com']
# ]

Finally, when the groups have been created, the function join_data_on_groups groups the original dictionary of users.最后,创建组后,函数join_data_on_groups将原始用户字典分组。 It takes the email-groups from before and the list of dictionaries as arguments:它以之前的电子邮件组和字典列表作为参数:

def join_data_on_groups(groups, item_to_data):
    item_to_data = {item: list(data) for item, data in item_to_data}

    groups = [(item_to_data[item] for item in group) for group in groups]
    groups = [list(chain(*group)) for group in groups]

    return groups


email_getter = itemgetter("email")
users_grouped_by_email = groupby(sorted(users, key=email_getter), email_getter)

user_groups = join_data_on_groups(email_groups, users_grouped_by_email)

print(user_groups)

Result:结果:

[
  [
    {'email': 'usera@email.com', 'id': 1, 'country': 'UK'},
    {'email': 'usera@email.com', 'id': 1, 'country': 'Germany'}, 
    {'email': 'userb@email.com', 'id': 2, 'country': 'UK'}
  ],
  [
    {'email': 'userc@email.com', 'id': 3, 'country': 'Italy'},
    {'email': 'userc@email.com', 'id': 3, 'country': 'Netherland'},
    {'email': 'userd@email.com', 'id': 4, 'country': 'France'}
  ]
]

I would consider using a queue or fifo type and popping elements off for use, instead of saving dicts in a list.我会考虑使用队列或 fifo 类型并弹出元素以供使用,而不是将字典保存在列表中。 But working with what you have you could either create a new sorted list first and do what you were doing (kinda), or here's another solution as there are many solutions to organizing data in any way imaginable (in fact, your constrainst is different in that you want to assign each output object to a variable name? I'll ignore that part):但是使用你所拥有的你可以先创建一个新的排序列表然后做你正在做的事情(有点),或者这是另一个解决方案,因为有很多解决方案可以以任何可以想象的方式组织数据(事实上,你的约束是不同的你想为每个输出对象分配一个变量名吗?我会忽略那部分):

  1. Create a dictionary D of type str:list where your key is the user email, and the list is a list of all dict entries from total_list that at first is empty [] .创建类型为 str:list 的字典 D,其中您的键是用户电子邮件,列表是total_list中所有字典条目的列表,最初为空[] If you have a lot of data, queing/generators would be better but the point is your filtering/formatting your input.如果您有大量数据,则排队/生成器会更好,但重点是过滤/格式化您的输入。
  2. Parse your total_list into D, so every hit of an identical user email, you append that dict to that key's value list.将您的total_list解析为 D,因此每次点击相同的用户电子邮件时,您都会将该字典附加到该键的值列表中。 total_list could be deleted. total_list可以删除。
  3. Parse D now, forming your output list (or generator) with lists of dictionaries, with a limit of 3 dicts per list.现在解析 D,用字典列表形成输出列表(或生成器),每个列表限制为 3 个字典。 This could be a generator similar to what you have now.这可能是一个类似于您现在拥有的发电机。

General solution (explanation below):一般解决方案(下面的解释):

import pandas as pd
import numpy as np
from numberpartitioning import karmarkar_karp

def solution(data, groupby: str, partition_size: int):
    df = pd.DataFrame(data)
    groups = df.groupby([groupby]).count()
    groupby_counts = groups.iloc[:, 0].values
    num_parts = len(df) // partition_size
    result = karmarkar_karp(groupby_counts, num_parts=num_parts, return_indices=True)
    part_keys = groups.index.values[np.array(result.partition)]
    partitions = [df.loc[df[groupby].isin(key)].to_dict('records') for key in part_keys]
    return partitions


solution(total_list, groupby="email", partition_size=3)

Gives a valid solution (although grouped slightly differently from your example solution)给出了一个有效的解决方案(尽管与您的示例解决方案略有不同)

[[{'country': 'UK', 'email': 'userb@email.com', 'id': 2},
  {'country': 'Italy', 'email': 'userc@email.com', 'id': 3},
  {'country': 'Netherland', 'email': 'userc@email.com', 'id': 3}],
 [{'country': 'UK', 'email': 'usera@email.com', 'id': 1},
  {'country': 'Germany', 'email': 'usera@email.com', 'id': 1},
  {'country': 'France', 'email': 'userd@email.com', 'id': 4}]]

Explanation解释

We can use a partitioning algorithm, like the Karmarkar-Karp Algorithm .我们可以使用分区算法,例如Karmarkar-Karp 算法 It partitions a set of numbers into k partitions such that sum of each partition is as close as possible.它将一组数字分成k个分区,使得每个分区的总和尽可能接近。 There already exists a pure Python implementation numberpartition .已经存在一个纯 Python 实现numberpartition Just python3 -m pip install numberpartitioning .只是python3 -m pip install numberpartitioning

The algorithm only works with numbers, but we can encode groups of emails using just the count of emails per group.该算法仅适用于数字,但我们可以仅使用每组电子邮件的数量对电子邮件组进行编码。 Let's use a dataframe to hold your data:让我们使用数据框来保存您的数据:

>>> df = pd.DataFrame(total_list)

Then find the counts, grouped by email:然后找到按电子邮件分组的计数:

>>> email_counts = df.groupby(["email"])["id"].count().rename("count")

For example, the group counts for total_list :例如, total_list的组计数:

>>> email_counts
email
usera@email.com    2
userb@email.com    1
userc@email.com    2
userd@email.com    1
Name: count, dtype: int64

In your example we want 3 entries per partition (so partition_size=3 ), which means the number of partitions is num_parts = len(total_list)/partition_size = 2在您的示例中,我们希望每个分区有 3 个条目(因此partition_size=3 ),这意味着分区数为num_parts = len(total_list)/partition_size = 2

So then if we do karmarkar_karp([2, 1, 2, 1], num_parts=True) , we get the following partition [[2, 1], [2, 1]] , and partition sizes [3, 3] .那么如果我们执行karmarkar_karp([2, 1, 2, 1], num_parts=True) ,我们得到以下分区[[2, 1], [2, 1]]和分区大小[3, 3]

But we don't care about the counts, we care about which email is associated with each count.但我们不关心计数,我们关心哪个电子邮件与每个计数相关联。 So, we simply return the indices:所以,我们简单地返回索引:

>>> result = karmarkar_karp(email_counts.values, num_parts=2, return_indices=True)
>>> result
PartitioningResult(partition=[[2, 1], [0, 3]], sizes=[3, 3])

Based on the indices, the groupings are:根据指数,分组是:

partition 1: indices [2, 1] -> [userc, userb]
partition 2: indices [0, 3] -> [usera, userd]

which is a little different than what you wrote, but nevertheless a valid solution.这与您写的略有不同,但仍然是一个有效的解决方案。

We find the email partitions by running:我们通过运行找到电子邮件分区:

>>> email_partitions = email_counts.index.values[np.array(result.partition)]

Given the email partitions, we now just have to split every entry in total_list based on which partition it belongs to.给定电子邮件分区,我们现在只需要根据它属于哪个分区来拆分total_list中的每个条目。

>>> partitions = [df.loc[df["email"].isin(emails)].to_dict('records') for emails in email_partitions]

And then printing partitions , we have:然后打印partitions ,我们有:

>>> partitions
[[{'email': 'userb@email.com', 'id': 2, 'country': 'UK'},
  {'email': 'userc@email.com', 'id': 3, 'country': 'Italy'},
  {'email': 'userc@email.com', 'id': 3, 'country': 'Netherland'}],
 [{'email': 'usera@email.com', 'id': 1, 'country': 'UK'},
  {'email': 'usera@email.com', 'id': 1, 'country': 'Germany'},
  {'email': 'userd@email.com', 'id': 4, 'country': 'France'}]]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM