I will try to explain my problem in the clearest way possible. Assume we have the df
dataframe:
import pandas as pd
users = ['a','b','c','d','e','f','g','h', 'a','b','c','g','h', 'b','c','d','e']
groups = ['g1']*8 + ['g2']*5 + ['g3']*4
scores = [0.54, 0.02, 0.78, 0.9 , 0.98, 0.27, 0.25, 0.98, 0.47, 0.02, 0.8, 0.51, 0.28, 0.53, 0.01, 0.51, 0.6 ]
df = pd.DataFrame({'user': users,
'group': groups,
'score': scores}).sort_values('score', ascending=False)
This will return something like this:
user group score
7 h g1 0.98
4 e g1 0.98
3 d g1 0.90
10 c g2 0.80
2 c g1 0.78
16 e g3 0.60
0 a g1 0.54
13 b g3 0.53
11 g g2 0.51
15 d g3 0.51
8 a g2 0.47
12 h g2 0.28
5 f g1 0.27
6 g g1 0.25
1 b g1 0.02
9 b g2 0.02
14 c g3 0.01
Each user has a certain score when belonging to each group. The thing is that each group can have a limited number of members. These numbers are stored in a dictionary:
members = {'g1': 3,
'g2': 2,
'g3': 1}
And here is the problem: I have to choose the best way to distribute the users in groups, taking into account their scores and the number of users each group can host.
If we take a look at the dataframe above, the best way to assign the users to the groups would be the following one:
h
, e
and d
belonging to g1
. Given that g1
can take up to 3 members, then these three users are assigned to it. Now g1
can't take any more members.c
belonging to g2
. Therefore g2
now has one slot left.c
, but this user was already assigned, so it can't be assigned twice. Thus, it must be ignored. The same happens with the following one, which relates e
(user that was already assigned to g1)
to g3
.a
to g1
, but this group is full. Thus, it has to be ignored as well.The solution I found is this one:
final = pd.DataFrame([])
# As long as there are non-assigned users and groups with free slots...
while len(df):
# Take the first row (i.e. the best score of the rows left)
i = df.first_valid_index()
# If there are free slots...
if members[df.loc[i,'group']] > 0:
# Subtract 1 from the slots left of this group
members[df.loc[i,'group']] -= 1
# Append this row to the 'final' DataFrame
final = final.append(df.loc[i])
# Delete all rows belonging to this user, as it was already assigned
df = df.loc[df.user != df.loc[i,'user']]
# If the group has no free slots left...
else:
# Delete all rows belonging to this group, as it is already full
df = df.loc[df.group != df.loc[i,'group']]
final = final.groupby('group').agg({'user': ['unique','count']})
This returns the following DataFrame:
user
unique count
group
g1 [h, d, f] 3
g2 [c, g] 2
g3 [b] 1
Here is the problem: this code takes forever to run in real life. I have more than 20 million different users, and there are approximately 10 different groups to fill. So this approach is really non-viable.
Is there a more efficient way to do this? I'm willing to take a sub-optimal solution if necessary. Namely, assigning the almost-best users to every group... If that makes sense.
Not exactly an answer, but it got too long for a comment.
Sorting a 20 million dataset shouldn't take that long, and everything after it should run in linear time. I have a hunch the deletions get really expensive, specifically the lines df = df.loc[...]
. Let us imagine you have 20M users, each occurring twice, so 40M rows. Each user will be deleted once. If each user deletion scans the entire DataFrame, that's 20M deletions with an average 20M remaining rows, so 400*10^12 operations.
You can implement the same algorithm without any deletions, in O(1) time per row scanned. Simply keep an "assigned" bit for each user (in lower-level languages you'd have a boolean array). When you assign a user, set its bit to 1. For each row, check that the group has remaining spots and the user is unassigned. Now no deletions are necessary; rows with assigned users will be skipped naturally.
Sorry I'm not fluent enough in Python to provide code.
This is what I would do:
df = df.pivot_table(index='user', columns='group', values='score').reset_index().fillna(0)
final = {}
df['sum'] = df.loc[:, 'g1':].sum(axis=1)
for group in members.keys():
df[group] = df[group] / df['sum']
for group in members.keys():
df = df.sort_values(group, ascending=False)
final[group] = list(df.head(members[group])['user'])
df = df.iloc[members[group]:, :]
final
Output:
{'g1': ['f', 'h', 'd'], 'g2': ['g', 'c'], 'g3': ['b']}
Explaination: For each user I'm calculating how relevant he is to any group in comperison to all the groups. Then every group gets the users who are the most oriented to that group, and I remove those users, and do the same with the other groups.
Below is an attempt following Cătălin Frâncu suggestion (using numpy instead of pandas)
This is a simplified version to show the dispatching according to your requirement.
There is the direct access to the ref
array (instead of using a mapping ( user_id
as in OP
))
I have not sorted by score (which is of few interest)
The dispatching seems to slow down around 9M most likely because all the users have been dispatched
n_users = 1e5
takes around 3s
while 1e7
I don't know I quit before.
def OP():
groups = [0,3,2,1] #respectively group
ref = []
users = ['a','b','c','d','e','f','g','h']
user_id = {}
for i in range(len(users)):
user_id[users[i]] = i
ref.append(False)
entries = []
entries.append(('h',1,'0.98'))
entries.append(('e',1,'0.98'))
entries.append(('d',1,'0.90'))
entries.append(('c',2,'0.80'))
entries.append(('c',1,'0.78'))
entries.append(('e',3,'0.60'))
entries.append(('a',1,'0.54'))
entries.append(('b',3,'0.53'))
entries.append(('g',2,'0.51'))
entries.append(('d',3,'0.51'))
entries.append(('a',2,'0.47'))
entries.append(('h',2,'0.28'))
entries.append(('f',1,'0.27'))
entries.append(('g',1,'0.25'))
entries.append(('b',1,'0.02'))
entries.append(('b',2,'0.02'))
entries.append(('c',3,'0.01'))
out = []
for u,g,s in entries:
if ref[user_id[u]] == True:
continue
if groups[g] > 0:
groups[g]-=1
out.append((u,g,s))
ref[user_id[u]] = True
print(out)
#[('h', 1, '0.98'), ('e', 1, '0.98'), ('d', 1, '0.90'), ('c', 2, '0.80'), ('b', 3, '0.53'), ('g', 2, '0.51')]
def test():
import numpy as np
n_users = int(1e7)
n_groups = 10
groups = [3,1e6,1e7,1e6,1e6,1e6,1e6,1e6,1e6,1e6]
print('allocating array')
N = n_users * n_groups
dscores = np.random.random((N,1))
dusers = np.random.randint(0, n_users, (N,1))
dgroups = np.random.randint(0, n_groups, (N,1))
print('building ref')
ref = np.zeros(n_users, dtype=int)
print('hstack')
entries = np.hstack((dusers, dgroups, dscores))
print('dispatching')
out = np.zeros((n_users, 3))
z = 0
counter = 0
for e in entries:
counter += 1
if counter % 1e6 == 0:
print('ccc', counter)
u,g,s = e
u = int(u)
g = int(g)
if ref[u] == 1:
continue
if groups[g] > 0:
groups[g]-=1
out[z][0] = u
out[z][1] = g
out[z][2] = s
ref[u] = 1
z += 1
if z % 1e5==0:
print('z : ', z)
print('done')
OP()
test()
I am by no mean an expert at Numba, and this might be slower. But I have, in the past, have success writing complex algorithms using Numba and loops. If you have a lot of data you might need to change int8 to a bigger datatype.
import pandas as pd
import numpy as np
import numba
# Basic setup:
users = ['a','b','c','d','e','f','g','h', 'a','b','c','g','h', 'b','c','d','e']
groups = ['g1']*8 + ['g2']*5 + ['g3']*4
scores = [0.54, 0.02, 0.78, 0.9 , 0.98, 0.27, 0.25, 0.98, 0.47, 0.02, 0.8, 0.51, 0.28, 0.53, 0.01, 0.51, 0.6 ]
df = pd.DataFrame({'user': users,
'group': groups,
'score': scores}).sort_values('score', ascending=False)
# Convert user, groups and limits to numbers:
df['user'] = df.user.astype('category')
df['group'] = df.group.astype('category')
df['usercat'] = df.user.cat.codes
df['groupcat'] = df.group.cat.codes
member_mapping_temp = dict( enumerate(df['group'].cat.categories ) )
members = {'g1': 3,
'g2': 2,
'g3': 1}
member_map = np.array([(x,members.get(y)) for x,y in member_mapping_temp.items()])
# Define numba njit function to solve problem:
from numba import types
from numba.typed import Dict, List
int_array = types.int8[:]
@numba.njit()
def calc_scores(numpy_array, member_map):
member_map_limits = Dict.empty(
key_type=types.int8,
value_type=types.int8,
)
member_count = Dict.empty(
key_type=types.int8,
value_type=types.int8,
)
memeber_list = []
for ix in range(len(member_map)):
group = member_map[ix,0]
limit = member_map[ix,1]
member_map_limits[group] = limit
member_count[group] = 0
seen_users = set()
for ix in range(len(numpy_array)):
user = numpy_array[ix,0]
group = numpy_array[ix,1]
if user in seen_users:
continue
if member_map_limits[group] == member_count[group]:
continue
member_count[group] = member_count[group] + 1
memeber_list.append((group,user))
seen_users.add(user)
return memeber_list
# Call function:
res = calc_scores(df[['usercat','groupcat']].to_numpy(), member_map)
# Add result to DF
res = pd.DataFrame(res, columns=['group','member'])
# Map back to values
res['group'] = pd.Categorical.from_codes(codes=res['group'], dtype=df['group'].dtype)
res['member'] = pd.Categorical.from_codes(codes=res['member'], dtype=df['user'].dtype)
Please let me know if this is any faster on the real dataset.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.