简体   繁体   中英

How to count combinations within a certain group?

I have data of people logging time to certain projects on certain dates. So my table will look something like this:

ProjectID Date   memberID hours
project1  01.05  a        2
project1  01.05  b        5
project2  05.05  a        1
project2  05.05  b        2
project2  05.05  c        3
project3  07.06  a        4
project3  07.06  b        1
project3  07.06  c        2

etc.

What I now want to do is to count for each project, for each combination of project members of that project, how much time they have worked on a project together in the past. If they both have worked on a project together, it should count the minimum of hours. Eg if member 1 worked 1 hour on the project and member 2 for 2 hours, it should count only 1 hour because the second hour, they cant have worked together.

Eg

ProjectID Date   memberID1 memberID2 hoursworkedtogether
project1   01.05  a         b         0
project2   05.05  a         b         2
project2   05.05  a         c         0
project2   05.05  b         c         0
project3   07.06  a         b         3
project3   07.06  b         c         2
project3   07.06  a         c         1

I've tried aggregating using pivot tables but that did not work as two project members will always be in different rows in the raw data and the pivot won't count combinations of values within the same row it seems.

One approach would be to write a simple loop and loop over all projects but I feel like there should be a more efficient option, as the table is quite large.

I am not sure, if this is the fastest solution, but pandas.apply() with list comprehensions have to be kind of fast... ;-)

Group you data by ProjectID and Date and use itertools.combinations() to create all combinations of users per project.

import pandas as pd
df = pd.DataFrame([['project1', '01.05', 'a', 2],
        ['project1', '01.05', 'b', 5],
        ['project2', '05.05', 'a', 1],
        ['project2', '05.05', 'b', 2],
        ['project2', '05.05', 'c', 3],
        ['project3', '07.06', 'a', 4],
        ['project3', '07.06', 'b', 1],
        ['project3', '07.06', 'c', 2]],
        columns=['ProjectID', 'Date', 'memberID', 'hours'])
from itertools import combinations
def calc_member_hours(project):
    data = [(x[0], 
             x[1], 
             *min(project['hours'][project['memberID']==x[0]].values,project['hours'][project['memberID']==x[1]].values)) 
                for x in list(combinations(project['memberID'],2))]
    df = pd.DataFrame(data, columns=['memberID1', 'memberID2', 'hoursworkedtogether'])
    return df

result_df = df.groupby(['ProjectID', 'Date']).apply(calc_member_hours)
result_df

在此输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM