How to find the best way to distribute N observations into M groups?

Question

I will try to explain my problem in the clearest way possible. Assume we have the df dataframe:

import pandas as pd

users = ['a','b','c','d','e','f','g','h', 'a','b','c','g','h', 'b','c','d','e']
groups = ['g1']*8 + ['g2']*5 + ['g3']*4
scores = [0.54, 0.02, 0.78, 0.9 , 0.98, 0.27, 0.25, 0.98, 0.47, 0.02, 0.8, 0.51, 0.28, 0.53, 0.01, 0.51, 0.6 ]
df = pd.DataFrame({'user': users,
                   'group': groups,
                   'score': scores}).sort_values('score', ascending=False)

This will return something like this:

   user group  score
7     h    g1   0.98
4     e    g1   0.98
3     d    g1   0.90
10    c    g2   0.80
2     c    g1   0.78
16    e    g3   0.60
0     a    g1   0.54
13    b    g3   0.53
11    g    g2   0.51
15    d    g3   0.51
8     a    g2   0.47
12    h    g2   0.28
5     f    g1   0.27
6     g    g1   0.25
1     b    g1   0.02
9     b    g2   0.02
14    c    g3   0.01

Each user has a certain score when belonging to each group. The thing is that each group can have a limited number of members. These numbers are stored in a dictionary:

members = {'g1': 3,
           'g2': 2,
           'g3': 1}

And here is the problem: I have to choose the best way to distribute the users in groups, taking into account their scores and the number of users each group can host.

If we take a look at the dataframe above, the best way to assign the users to the groups would be the following one:

The highest scores are the ones assigned to h , e and d belonging to g1 . Given that g1 can take up to 3 members, then these three users are assigned to it. Now g1 can't take any more members.
The following best score is the one assigned to c belonging to g2 . Therefore g2 now has one slot left.
Observe that the following score also refers to c , but this user was already assigned, so it can't be assigned twice. Thus, it must be ignored. The same happens with the following one, which relates e (user that was already assigned to g1) to g3 .
The following one relates a to g1 , but this group is full. Thus, it has to be ignored as well.
The process has to go on until all groups are full, or until there are no more rows left to fill the groups (in this case, some groups will have free slots left).

The solution I found is this one:

final = pd.DataFrame([])
# As long as there are non-assigned users and groups with free slots...
while len(df):
    # Take the first row (i.e. the best score of the rows left)
    i = df.first_valid_index()
    # If there are free slots...
    if members[df.loc[i,'group']] > 0:
        # Subtract 1 from the slots left of this group
        members[df.loc[i,'group']] -= 1
        # Append this row to the 'final' DataFrame
        final = final.append(df.loc[i])
        # Delete all rows belonging to this user, as it was already assigned
        df = df.loc[df.user != df.loc[i,'user']]
    # If the group has no free slots left...
    else:
        # Delete all rows belonging to this group, as it is already full
        df = df.loc[df.group != df.loc[i,'group']]
final = final.groupby('group').agg({'user': ['unique','count']})

This returns the following DataFrame:

            user      
          unique count
group                 
g1     [h, d, f]     3
g2        [c, g]     2
g3           [b]     1

Here is the problem: this code takes forever to run in real life. I have more than 20 million different users, and there are approximately 10 different groups to fill. So this approach is really non-viable.

Is there a more efficient way to do this? I'm willing to take a sub-optimal solution if necessary. Namely, assigning the almost-best users to every group... If that makes sense.

Answer 1

Not exactly an answer, but it got too long for a comment.

Sorting a 20 million dataset shouldn't take that long, and everything after it should run in linear time. I have a hunch the deletions get really expensive, specifically the lines df = df.loc[...] . Let us imagine you have 20M users, each occurring twice, so 40M rows. Each user will be deleted once. If each user deletion scans the entire DataFrame, that's 20M deletions with an average 20M remaining rows, so 400*10^12 operations.

You can implement the same algorithm without any deletions, in O(1) time per row scanned. Simply keep an "assigned" bit for each user (in lower-level languages you'd have a boolean array). When you assign a user, set its bit to 1. For each row, check that the group has remaining spots and the user is unassigned. Now no deletions are necessary; rows with assigned users will be skipped naturally.

Sorry I'm not fluent enough in Python to provide code.

Answer 2

This is what I would do:

df = df.pivot_table(index='user', columns='group', values='score').reset_index().fillna(0)
final = {}
df['sum'] = df.loc[:, 'g1':].sum(axis=1)
for group in members.keys():
    df[group] = df[group] / df['sum']
for group in members.keys():
    df = df.sort_values(group, ascending=False)
    final[group] = list(df.head(members[group])['user'])
    df = df.iloc[members[group]:, :]
final

Output:

{'g1': ['f', 'h', 'd'], 'g2': ['g', 'c'], 'g3': ['b']}

Explaination: For each user I'm calculating how relevant he is to any group in comperison to all the groups. Then every group gets the users who are the most oriented to that group, and I remove those users, and do the same with the other groups.

Answer 3

Below is an attempt following Cătălin Frâncu suggestion (using numpy instead of pandas)

in OP

This is a simplified version to show the dispatching according to your requirement.

in test

There is the direct access to the ref array (instead of using a mapping ( user_id as in OP ))

I have not sorted by score (which is of few interest)

The dispatching seems to slow down around 9M most likely because all the users have been dispatched

n_users = 1e5 takes around 3s while 1e7 I don't know I quit before.


def OP():
    groups = [0,3,2,1] #respectively group
    ref = []
    users = ['a','b','c','d','e','f','g','h']
    user_id = {}
    for i in range(len(users)):
        user_id[users[i]] = i
        ref.append(False)
    entries = []
    entries.append(('h',1,'0.98'))
    entries.append(('e',1,'0.98'))
    entries.append(('d',1,'0.90'))
    entries.append(('c',2,'0.80'))
    entries.append(('c',1,'0.78'))
    entries.append(('e',3,'0.60'))
    entries.append(('a',1,'0.54'))
    entries.append(('b',3,'0.53'))
    entries.append(('g',2,'0.51'))
    entries.append(('d',3,'0.51'))
    entries.append(('a',2,'0.47'))
    entries.append(('h',2,'0.28'))
    entries.append(('f',1,'0.27'))
    entries.append(('g',1,'0.25'))
    entries.append(('b',1,'0.02'))
    entries.append(('b',2,'0.02'))
    entries.append(('c',3,'0.01'))

    out = []
    for u,g,s in entries:
        if ref[user_id[u]] == True:
            continue
        if groups[g] > 0:
            groups[g]-=1
            out.append((u,g,s))
            ref[user_id[u]] = True

    print(out)
    #[('h', 1, '0.98'), ('e', 1, '0.98'), ('d', 1, '0.90'), ('c', 2, '0.80'), ('b', 3, '0.53'), ('g', 2, '0.51')]

def test():
    import numpy as np
    n_users = int(1e7)
    n_groups = 10
    groups = [3,1e6,1e7,1e6,1e6,1e6,1e6,1e6,1e6,1e6]

    print('allocating array')
    N = n_users * n_groups
    dscores = np.random.random((N,1))
    dusers = np.random.randint(0, n_users, (N,1))
    dgroups = np.random.randint(0, n_groups, (N,1))

    print('building ref')
    ref = np.zeros(n_users, dtype=int)

    print('hstack')
    entries = np.hstack((dusers, dgroups, dscores))

    print('dispatching')
    out = np.zeros((n_users, 3))
    z = 0
    counter = 0
    for e in entries:
        counter += 1
        if counter % 1e6 == 0:
            print('ccc', counter)
        u,g,s = e
        u = int(u)
        g = int(g)
        if ref[u] == 1:
            continue
        if groups[g] > 0:
            groups[g]-=1
            out[z][0] = u
            out[z][1] = g
            out[z][2] = s
            ref[u] = 1
            z += 1
            if z % 1e5==0:
                print('z : ', z)
    print('done')

OP()
test()

Answer 4

I am by no mean an expert at Numba, and this might be slower. But I have, in the past, have success writing complex algorithms using Numba and loops. If you have a lot of data you might need to change int8 to a bigger datatype.

import pandas as pd
import numpy as np
import numba

# Basic setup:
users = ['a','b','c','d','e','f','g','h', 'a','b','c','g','h', 'b','c','d','e']
groups = ['g1']*8 + ['g2']*5 + ['g3']*4
scores = [0.54, 0.02, 0.78, 0.9 , 0.98, 0.27, 0.25, 0.98, 0.47, 0.02, 0.8, 0.51, 0.28, 0.53, 0.01, 0.51, 0.6 ]
df = pd.DataFrame({'user': users,
                   'group': groups,
                   'score': scores}).sort_values('score', ascending=False)

# Convert user, groups and limits to numbers:
df['user'] = df.user.astype('category')
df['group'] = df.group.astype('category')
df['usercat'] = df.user.cat.codes
df['groupcat'] = df.group.cat.codes

member_mapping_temp = dict( enumerate(df['group'].cat.categories ) )

members = {'g1': 3,
           'g2': 2,
           'g3': 1}

member_map = np.array([(x,members.get(y)) for x,y in member_mapping_temp.items()])

# Define numba njit function to solve problem:
from numba import types
from numba.typed import Dict, List
int_array = types.int8[:]

@numba.njit()
def calc_scores(numpy_array, member_map):
    member_map_limits = Dict.empty(
      key_type=types.int8,
      value_type=types.int8,
    )
    member_count = Dict.empty(
      key_type=types.int8,
      value_type=types.int8,
    )
    memeber_list = []
    for ix in range(len(member_map)):
        group = member_map[ix,0]
        limit = member_map[ix,1]
        member_map_limits[group] = limit
        member_count[group] = 0

    seen_users = set()

    for ix in range(len(numpy_array)):
        user = numpy_array[ix,0]
        group = numpy_array[ix,1]
        if user in seen_users:
            continue
        if member_map_limits[group] == member_count[group]:
            continue
        member_count[group] = member_count[group] + 1
        memeber_list.append((group,user))
        seen_users.add(user)

    return memeber_list

# Call function:
res = calc_scores(df[['usercat','groupcat']].to_numpy(), member_map)

# Add result to DF
res = pd.DataFrame(res, columns=['group','member'])

# Map back to values
res['group'] = pd.Categorical.from_codes(codes=res['group'], dtype=df['group'].dtype)
res['member'] = pd.Categorical.from_codes(codes=res['member'], dtype=df['user'].dtype)

Please let me know if this is any faster on the real dataset.

How to find the best way to distribute N observations into M groups?

Question

4 answers

solution1
1 ACCPTED 2019-11-11 15:33:54

solution2
0 2019-11-11 15:04:59

solution3
0 2019-11-11 19:18:39

solution4
0 2019-11-11 19:54:34

How to find the best way to distribute N observations into M groups?

Question

4 answers

solution1 1 ACCPTED 2019-11-11 15:33:54

solution2 0 2019-11-11 15:04:59

solution3 0 2019-11-11 19:18:39

solution4 0 2019-11-11 19:54:34

solution1
1 ACCPTED 2019-11-11 15:33:54

solution2
0 2019-11-11 15:04:59

solution3
0 2019-11-11 19:18:39

solution4
0 2019-11-11 19:54:34