Pandas - how to add multiple conditional columns to dataframe?

Question

So what I'm trying to do is create several calculations based on conditions and add them to a dataframe. Here is some code as an example:

data = pd.DataFrame({'Project':  ['Project 1', 'Project 2', ' Project 3', 'Project 4', 'Project 5', 'Project 6', 'Project 7', 'Project 8', 'Project 9', 'Project 10'],
        'Date': ['10/1/2020', '10/1/2020', '11/1/2020', '12/1/2020', '12/1/2020', '12/1/2020', '2/1/2021', '2/1/2021', '3/1/2021', '4/1/2021'],
       'Team': ['Team 1', 'Team 2', 'Team 3', 'Team 2', 'Team 2', 'Team 2', 'Team 1', 'Team 1', 'Team 1', 'Team 1'],
       'Timely': ['Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'Yes']})
                                       
df1=data.groupby('Team')['Timely'].apply(lambda x: (x=='Yes').sum()).reset_index(name='Yes Count')       

df1

Running this code will give me a table that shows the count of projects with timely = Yes for each team. Say I wanted to add a second column to the table to show the total count where timely = no and a third and fourth column to show the percent yes and percent no for each team. How would I do that?

Answer 1

Seems to me like you just want a crosstab. It's more straightforward to calculate using pd.crosstab than it is using a groupby/apply approach. Though pd.crosstab isn't the most efficient method of doing this, it works just fine.

This first snippet gets us the counts of "yes" and "no" for each team. We also get a "Total" column that is the sum of the "yes" + "no":

xtab = pd.crosstab(data["Team"], data["Timely"], margins=True, margins_name="Total")

print(xtab)
Timely  No  Yes  Total
Team                  
Team 1   1    4      5
Team 2   2    2      4
Team 3   1    0      1
Total    4    6     10

Now let's calculate the percentage breakdown of "No" and "Yes" by dividing those both by our "Total" column:

norm_xtab = xtab[["No", "Yes"]].div(xtab["Total"], axis=0)

print(norm_xtab)
Timely   No  Yes
Team            
Team 1  0.2  0.8
Team 2  0.5  0.5
Team 3  1.0  0.0
Total   0.4  0.6

Then we can combine those two dataframes horizontally with the pd.concat function. We can also give names to each "subdataframe" eg one chunk of the dataframe will only be for count data, the other chunk will be for the normalized/percentage based data:

out = pd.concat([xtab, norm_xtab], keys=["count", "percent"], axis=1)
out = out.rename(columns={"No": "untimely", "Yes": "timely"}, level=1)


print(out)
          count               percent       
Timely untimely timely Total untimely timely
Team                                        
Team 1        1      4     5      0.2    0.8
Team 2        2      2     4      0.5    0.5
Team 3        1      0     1      1.0    0.0
Total         4      6    10      0.4    0.6

The result of the concatenation is a DataFrame with a multiindex as the columns. If you are unfamiliar with pd.MultiIndex, you can use this snippet to flatten your result into something you may be more familiar with:

out.columns = ["{}_{}".format(*colname) for colname in out.columns]

print(out)

        count_untimely  count_timely  count_Total  percent_untimely  percent_timely
Team                                                                               
Team 1               1             4            5               0.2             0.8
Team 2               2             2            4               0.5             0.5
Team 3               1             0            1               1.0             0.0
Total                4             6           10               0.4             0.6

To change the column names as in your comment below, you can simply run rename one more time:

out = out.rename(columns=lambda colname: colname.replace("count_", ""))

print(out)
        untimely  timely  Total  percent_untimely  percent_timely
Team                                                             
Team 1         1       4      5               0.2             0.8
Team 2         2       2      4               0.5             0.5
Team 3         1       0      1               1.0             0.0
Total          4       6     10               0.4             0.6

To convert the decimals to whole percentages, you'll need to multiply by 100, then either round to 0 decimal places OR use string formatting to trim the trailing decimals (I'll show you how to do both), and add another string formatting to get the "%" to appear.

percentages = (out
               .filter(like="percent")      # select columns that contain the string "percent"
               .mul(100)                    # Multiply everything by 100
               .round(0)                    # If you don't want to perform any rounding, get rid of this line
               .applymap("{:.0f}%".format)) # Apply string formatting to trim the decimal.

print(percentages)
       percent_untimely percent_timely
Team                                  
Team 1              20%            80%
Team 2              50%            50%
Team 3             100%             0%
Total               40%            60%

If you want these values reflect back in our out dataframe:

out.loc[:, percentages.columns] = percentages

print(out)
        untimely  timely  Total percent_untimely percent_timely
Team                                                           
Team 1         1       4      5              20%            80%
Team 2         2       2      4              50%            50%
Team 3         1       0      1             100%             0%
Total          4       6     10              40%            60%

Pandas - how to add multiple conditional columns to dataframe?

Question

1 answers

solution1
1 ACCPTED 2020-09-28 22:28:30

Pandas - how to add multiple conditional columns to dataframe?

Question

1 answers

solution1 1 ACCPTED 2020-09-28 22:28:30

solution1
1 ACCPTED 2020-09-28 22:28:30