I currently have a list of dictionaries that looks like that:
total_list = [
{'email': 'usera@email.com', 'id': 1, 'country': 'UK'},
{'email': 'usera@email.com', 'id': 1, 'country': 'Germany'},
{'email': 'userb@email.com', 'id': 2, 'country': 'UK'}
{'email': 'userc@email.com', 'id': 3, 'country': 'Italy'},
{'email': 'userc@email.com', 'id': 3, 'country': 'Netherland'},
{'email': 'userd@email.com', 'id': 4, 'country': 'France'},
...
]
I want to split it primarily based on size, so let's say that the new size list is 3 items per list, But I also want to make sure that all the same users will be in the same new sublist.
So the result I am trying to create is:
list_a = [
{'email': 'usera@email.com', 'id': 1, 'country': 'UK'},
{'email': 'userb@email.com', 'id': 2, 'country': 'UK'}
{'email': 'usera@email.com', 'id': 1, 'country': 'Germany'}
]
list_b = [
{'email': 'userc@email.com', 'id': 3, 'country': 'Italy'},
{'email': 'userd@email.com', 'id': 4, 'country': 'France'}
{'email': 'userc@email.com', 'id': 3, 'country': 'Netherland'},
...
]
Obviously in the example that I provided the users were located really close to each other in the list, but in reality, they could be spread way more. I was considering sorting the list based on the email and then splitting them, but I am not sure what happens if the items that are supposed to be grouped together happen to be at the exact location that the main list will be divided.
What I have tried so far is:
def list_splitter(main_list, size):
for i in range(0, len(main_list), size):
yield main_list[i:i + size]
# calculating the needed number of sublists
max_per_batch = 3
number_of_sublists = ceil(len(total_list) / max_per_batch)
# sort the data by email
total_list.sort(key=lambda x: x['email'])
sublists = list(list_splitter(main_list=total_list, size=max_per_batch))
The issue is that with this logic I cannot 100% ensure that if there are any items with the same email value they will end up in the same sublist. Because of the sorting, chances are that this will happen, but it is not certain.
Basically, I need a method to make sure that items with the same email
will always be in the same sublist, but the main condition of the split is the sublist size.
This solution starts of by only working with the list of all emails. The emails are then grouped based on their frequency and the limit
on group size. Later the remaining data, ie id
and country
, are joined back on the email groups.
The first function create_groups
works on the list of emails. It counts the number of occurrences of each email and groups them. Each new group starts with the most frequent email. If there is room left in the group it looks for the most frequent that also fits in the group. If such an item exists, it is added to the group.
This is repeated until the group is full; then, a new group is started.
from operator import itemgetter
from itertools import groupby, chain
from collections import Counter
def create_groups(items, group_size_limit):
# Count the frequency of all items and create a list of items
# sorted by descending frequency
items_not_grouped = Counter(items).most_common()
groups = []
while items_not_grouped:
# Start a new group with the most frequent ungrouped item
item, count = items_not_grouped.pop(0)
group, group_size = [item], count
while group_size < group_size_limit:
# If there is room left in the group, look for a new group member
for index, (candidate, candidate_count) \
in enumerate(items_not_grouped):
if candidate_count <= group_size_limit - group_size:
# If the candidate fits, add it to the group
group.append(candidate)
group_size += candidate_count
# ... and remove it from the items not grouped
items_not_grouped.pop(index)
break
else:
# If the for loop did not break, no items fit in the group
break
groups.append(group)
return groups
This is the result of using that function on your example:
users = [
{'email': 'usera@email.com', 'id': 1, 'country': 'UK',},
{'email': 'userb@email.com', 'id': 2, 'country': 'UK'},
{'email': 'usera@email.com', 'id': 1, 'country': 'Germany'},
{'email': 'userc@email.com', 'id': 3, 'country': 'Italy'},
{'email': 'userd@email.com', 'id': 4, 'country': 'France'},
{'email': 'userc@email.com', 'id': 3, 'country': 'Netherland'}
]
emails = [user["email"] for user in users]
email_groups = create_groups(emails, 3)
# -> [
# ['usera@email.com', 'userb@email.com'],
# ['userc@email.com', 'userd@email.com']
# ]
Finally, when the groups have been created, the function join_data_on_groups
groups the original dictionary of users. It takes the email-groups from before and the list of dictionaries as arguments:
def join_data_on_groups(groups, item_to_data):
item_to_data = {item: list(data) for item, data in item_to_data}
groups = [(item_to_data[item] for item in group) for group in groups]
groups = [list(chain(*group)) for group in groups]
return groups
email_getter = itemgetter("email")
users_grouped_by_email = groupby(sorted(users, key=email_getter), email_getter)
user_groups = join_data_on_groups(email_groups, users_grouped_by_email)
print(user_groups)
Result:
[
[
{'email': 'usera@email.com', 'id': 1, 'country': 'UK'},
{'email': 'usera@email.com', 'id': 1, 'country': 'Germany'},
{'email': 'userb@email.com', 'id': 2, 'country': 'UK'}
],
[
{'email': 'userc@email.com', 'id': 3, 'country': 'Italy'},
{'email': 'userc@email.com', 'id': 3, 'country': 'Netherland'},
{'email': 'userd@email.com', 'id': 4, 'country': 'France'}
]
]
I would consider using a queue or fifo type and popping elements off for use, instead of saving dicts in a list. But working with what you have you could either create a new sorted list first and do what you were doing (kinda), or here's another solution as there are many solutions to organizing data in any way imaginable (in fact, your constrainst is different in that you want to assign each output object to a variable name? I'll ignore that part):
total_list
that at first is empty []
. If you have a lot of data, queing/generators would be better but the point is your filtering/formatting your input.total_list
into D, so every hit of an identical user email, you append that dict to that key's value list. total_list
could be deleted. import pandas as pd
import numpy as np
from numberpartitioning import karmarkar_karp
def solution(data, groupby: str, partition_size: int):
df = pd.DataFrame(data)
groups = df.groupby([groupby]).count()
groupby_counts = groups.iloc[:, 0].values
num_parts = len(df) // partition_size
result = karmarkar_karp(groupby_counts, num_parts=num_parts, return_indices=True)
part_keys = groups.index.values[np.array(result.partition)]
partitions = [df.loc[df[groupby].isin(key)].to_dict('records') for key in part_keys]
return partitions
solution(total_list, groupby="email", partition_size=3)
Gives a valid solution (although grouped slightly differently from your example solution)
[[{'country': 'UK', 'email': 'userb@email.com', 'id': 2},
{'country': 'Italy', 'email': 'userc@email.com', 'id': 3},
{'country': 'Netherland', 'email': 'userc@email.com', 'id': 3}],
[{'country': 'UK', 'email': 'usera@email.com', 'id': 1},
{'country': 'Germany', 'email': 'usera@email.com', 'id': 1},
{'country': 'France', 'email': 'userd@email.com', 'id': 4}]]
We can use a partitioning algorithm, like the Karmarkar-Karp Algorithm . It partitions a set of numbers into k
partitions such that sum of each partition is as close as possible. There already exists a pure Python implementation numberpartition . Just python3 -m pip install numberpartitioning
.
The algorithm only works with numbers, but we can encode groups of emails using just the count of emails per group. Let's use a dataframe to hold your data:
>>> df = pd.DataFrame(total_list)
Then find the counts, grouped by email:
>>> email_counts = df.groupby(["email"])["id"].count().rename("count")
For example, the group counts for total_list
:
>>> email_counts
email
usera@email.com 2
userb@email.com 1
userc@email.com 2
userd@email.com 1
Name: count, dtype: int64
In your example we want 3 entries per partition (so partition_size=3
), which means the number of partitions is num_parts = len(total_list)/partition_size = 2
So then if we do karmarkar_karp([2, 1, 2, 1], num_parts=True)
, we get the following partition [[2, 1], [2, 1]]
, and partition sizes [3, 3]
.
But we don't care about the counts, we care about which email is associated with each count. So, we simply return the indices:
>>> result = karmarkar_karp(email_counts.values, num_parts=2, return_indices=True)
>>> result
PartitioningResult(partition=[[2, 1], [0, 3]], sizes=[3, 3])
Based on the indices, the groupings are:
partition 1: indices [2, 1] -> [userc, userb]
partition 2: indices [0, 3] -> [usera, userd]
which is a little different than what you wrote, but nevertheless a valid solution.
We find the email partitions by running:
>>> email_partitions = email_counts.index.values[np.array(result.partition)]
Given the email partitions, we now just have to split every entry in total_list
based on which partition it belongs to.
>>> partitions = [df.loc[df["email"].isin(emails)].to_dict('records') for emails in email_partitions]
And then printing partitions
, we have:
>>> partitions
[[{'email': 'userb@email.com', 'id': 2, 'country': 'UK'},
{'email': 'userc@email.com', 'id': 3, 'country': 'Italy'},
{'email': 'userc@email.com', 'id': 3, 'country': 'Netherland'}],
[{'email': 'usera@email.com', 'id': 1, 'country': 'UK'},
{'email': 'usera@email.com', 'id': 1, 'country': 'Germany'},
{'email': 'userd@email.com', 'id': 4, 'country': 'France'}]]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.