Split list of dictionaries in separate lists based primarily on list size but secondarily based on condition

Question

I currently have a list of dictionaries that looks like that:

total_list = [
    {'email': 'usera@email.com', 'id': 1, 'country': 'UK'},
    {'email': 'usera@email.com', 'id': 1, 'country': 'Germany'}, 
    {'email': 'userb@email.com', 'id': 2, 'country': 'UK'}
    {'email': 'userc@email.com', 'id': 3, 'country': 'Italy'},
    {'email': 'userc@email.com', 'id': 3, 'country': 'Netherland'},
    {'email': 'userd@email.com', 'id': 4, 'country': 'France'},
    ...
]

I want to split it primarily based on size, so let's say that the new size list is 3 items per list, But I also want to make sure that all the same users will be in the same new sublist.

So the result I am trying to create is:

list_a = [
    {'email': 'usera@email.com', 'id': 1, 'country': 'UK'},
    {'email': 'userb@email.com', 'id': 2, 'country': 'UK'}    
    {'email': 'usera@email.com', 'id': 1, 'country': 'Germany'}
]
  
list_b = [
    {'email': 'userc@email.com', 'id': 3, 'country': 'Italy'},
    {'email': 'userd@email.com', 'id': 4, 'country': 'France'}
    {'email': 'userc@email.com', 'id': 3, 'country': 'Netherland'},
    ...
]

Obviously in the example that I provided the users were located really close to each other in the list, but in reality, they could be spread way more. I was considering sorting the list based on the email and then splitting them, but I am not sure what happens if the items that are supposed to be grouped together happen to be at the exact location that the main list will be divided.

What I have tried so far is:

def list_splitter(main_list, size):
    for i in range(0, len(main_list), size):
        yield main_list[i:i + size]

# calculating the needed number of sublists
max_per_batch = 3
number_of_sublists = ceil(len(total_list) / max_per_batch)

# sort the data by email
total_list.sort(key=lambda x: x['email'])

sublists = list(list_splitter(main_list=total_list, size=max_per_batch))

The issue is that with this logic I cannot 100% ensure that if there are any items with the same email value they will end up in the same sublist. Because of the sorting, chances are that this will happen, but it is not certain.

Basically, I need a method to make sure that items with the same email will always be in the same sublist, but the main condition of the split is the sublist size.

Answer 1

This solution starts of by only working with the list of all emails. The emails are then grouped based on their frequency and the limit on group size. Later the remaining data, ie id and country , are joined back on the email groups.

The first function create_groups works on the list of emails. It counts the number of occurrences of each email and groups them. Each new group starts with the most frequent email. If there is room left in the group it looks for the most frequent that also fits in the group. If such an item exists, it is added to the group.

This is repeated until the group is full; then, a new group is started.

from operator import itemgetter
from itertools import groupby, chain
from collections import Counter


def create_groups(items, group_size_limit):
    # Count the frequency of all items and create a list of items 
    # sorted by descending frequency
    items_not_grouped = Counter(items).most_common()
    groups = []

    while items_not_grouped:
        # Start a new group with the most frequent ungrouped item
        item, count = items_not_grouped.pop(0)
        group, group_size = [item], count
        while group_size < group_size_limit:
            # If there is room left in the group, look for a new group member
            for index, (candidate, candidate_count) \
                    in enumerate(items_not_grouped):
                if candidate_count <= group_size_limit - group_size:
                    # If the candidate fits, add it to the group
                    group.append(candidate)
                    group_size += candidate_count
                    # ... and remove it from the items not grouped
                    items_not_grouped.pop(index)
                    break
            else:
                # If the for loop did not break, no items fit in the group
                break

        groups.append(group)

    return groups

This is the result of using that function on your example:

users = [
    {'email': 'usera@email.com', 'id': 1, 'country': 'UK',},
    {'email': 'userb@email.com', 'id': 2, 'country': 'UK'},
    {'email': 'usera@email.com', 'id': 1, 'country': 'Germany'},
    {'email': 'userc@email.com', 'id': 3, 'country': 'Italy'},
    {'email': 'userd@email.com', 'id': 4, 'country': 'France'},
    {'email': 'userc@email.com', 'id': 3, 'country': 'Netherland'}
]

emails = [user["email"] for user in users]
email_groups = create_groups(emails, 3)
# -> [
#   ['usera@email.com', 'userb@email.com'], 
#   ['userc@email.com', 'userd@email.com']
# ]

Finally, when the groups have been created, the function join_data_on_groups groups the original dictionary of users. It takes the email-groups from before and the list of dictionaries as arguments:

def join_data_on_groups(groups, item_to_data):
    item_to_data = {item: list(data) for item, data in item_to_data}

    groups = [(item_to_data[item] for item in group) for group in groups]
    groups = [list(chain(*group)) for group in groups]

    return groups


email_getter = itemgetter("email")
users_grouped_by_email = groupby(sorted(users, key=email_getter), email_getter)

user_groups = join_data_on_groups(email_groups, users_grouped_by_email)

print(user_groups)

Result:

[
  [
    {'email': 'usera@email.com', 'id': 1, 'country': 'UK'},
    {'email': 'usera@email.com', 'id': 1, 'country': 'Germany'}, 
    {'email': 'userb@email.com', 'id': 2, 'country': 'UK'}
  ],
  [
    {'email': 'userc@email.com', 'id': 3, 'country': 'Italy'},
    {'email': 'userc@email.com', 'id': 3, 'country': 'Netherland'},
    {'email': 'userd@email.com', 'id': 4, 'country': 'France'}
  ]
]

Answer 2

I would consider using a queue or fifo type and popping elements off for use, instead of saving dicts in a list. But working with what you have you could either create a new sorted list first and do what you were doing (kinda), or here's another solution as there are many solutions to organizing data in any way imaginable (in fact, your constrainst is different in that you want to assign each output object to a variable name? I'll ignore that part):

Create a dictionary D of type str:list where your key is the user email, and the list is a list of all dict entries from total_list that at first is empty [] . If you have a lot of data, queing/generators would be better but the point is your filtering/formatting your input.
Parse your total_list into D, so every hit of an identical user email, you append that dict to that key's value list. total_list could be deleted.
Parse D now, forming your output list (or generator) with lists of dictionaries, with a limit of 3 dicts per list. This could be a generator similar to what you have now.

Answer 3

General solution (explanation below):

import pandas as pd
import numpy as np
from numberpartitioning import karmarkar_karp

def solution(data, groupby: str, partition_size: int):
    df = pd.DataFrame(data)
    groups = df.groupby([groupby]).count()
    groupby_counts = groups.iloc[:, 0].values
    num_parts = len(df) // partition_size
    result = karmarkar_karp(groupby_counts, num_parts=num_parts, return_indices=True)
    part_keys = groups.index.values[np.array(result.partition)]
    partitions = [df.loc[df[groupby].isin(key)].to_dict('records') for key in part_keys]
    return partitions


solution(total_list, groupby="email", partition_size=3)

Gives a valid solution (although grouped slightly differently from your example solution)

[[{'country': 'UK', 'email': 'userb@email.com', 'id': 2},
  {'country': 'Italy', 'email': 'userc@email.com', 'id': 3},
  {'country': 'Netherland', 'email': 'userc@email.com', 'id': 3}],
 [{'country': 'UK', 'email': 'usera@email.com', 'id': 1},
  {'country': 'Germany', 'email': 'usera@email.com', 'id': 1},
  {'country': 'France', 'email': 'userd@email.com', 'id': 4}]]

Explanation

We can use a partitioning algorithm, like the Karmarkar-Karp Algorithm . It partitions a set of numbers into k partitions such that sum of each partition is as close as possible. There already exists a pure Python implementation numberpartition . Just python3 -m pip install numberpartitioning .

The algorithm only works with numbers, but we can encode groups of emails using just the count of emails per group. Let's use a dataframe to hold your data:

>>> df = pd.DataFrame(total_list)

Then find the counts, grouped by email:

>>> email_counts = df.groupby(["email"])["id"].count().rename("count")

For example, the group counts for total_list :

>>> email_counts
email
usera@email.com    2
userb@email.com    1
userc@email.com    2
userd@email.com    1
Name: count, dtype: int64

In your example we want 3 entries per partition (so partition_size=3 ), which means the number of partitions is num_parts = len(total_list)/partition_size = 2

So then if we do karmarkar_karp([2, 1, 2, 1], num_parts=True) , we get the following partition [[2, 1], [2, 1]] , and partition sizes [3, 3] .

But we don't care about the counts, we care about which email is associated with each count. So, we simply return the indices:

>>> result = karmarkar_karp(email_counts.values, num_parts=2, return_indices=True)
>>> result
PartitioningResult(partition=[[2, 1], [0, 3]], sizes=[3, 3])

Based on the indices, the groupings are:

partition 1: indices [2, 1] -> [userc, userb]
partition 2: indices [0, 3] -> [usera, userd]

which is a little different than what you wrote, but nevertheless a valid solution.

We find the email partitions by running:

>>> email_partitions = email_counts.index.values[np.array(result.partition)]

Given the email partitions, we now just have to split every entry in total_list based on which partition it belongs to.

>>> partitions = [df.loc[df["email"].isin(emails)].to_dict('records') for emails in email_partitions]

And then printing partitions , we have:

>>> partitions
[[{'email': 'userb@email.com', 'id': 2, 'country': 'UK'},
  {'email': 'userc@email.com', 'id': 3, 'country': 'Italy'},
  {'email': 'userc@email.com', 'id': 3, 'country': 'Netherland'}],
 [{'email': 'usera@email.com', 'id': 1, 'country': 'UK'},
  {'email': 'usera@email.com', 'id': 1, 'country': 'Germany'},
  {'email': 'userd@email.com', 'id': 4, 'country': 'France'}]]

Split list of dictionaries in separate lists based primarily on list size but secondarily based on condition

Question

3 answers

solution1
3 2022-11-12 09:57:24

solution2
0 2022-11-09 04:45:51

solution3
0 2022-11-18 19:55:56

General solution (explanation below):

Explanation

Split list of dictionaries in separate lists based primarily on list size but secondarily based on condition

Question

3 answers

solution1 3 2022-11-12 09:57:24

solution2 0 2022-11-09 04:45:51

solution3 0 2022-11-18 19:55:56

General solution (explanation below):

Explanation

solution1
3 2022-11-12 09:57:24

solution2
0 2022-11-09 04:45:51

solution3
0 2022-11-18 19:55:56