Stacked Bar Plot By Group Count On Pandas Python

Question

My csv data looks something like the one provided below. I wanted to create a stack bar plot with pandas/python where each bar represent male and female portions with two colors and on the top of the bar it shows the total count of both male and female taking the drug(in my case). For instance, for the Age of 20 fall total of 7 people and 6 of them are male and 1 is female so on the bar plot there should be 7 on the top of the bar and this 6:1 portion is shown in the bar with two colors. I managed to group the people according to their age count and plot it but I wanted to show the bar with two genders on different colors as well. Any help will be appreciated . Thank you.

Drug_ID,Age,Gender
FF90E5F780133BBCAEE9BA0E4654C5CA,15,F
2AB2529352E6A1D0BA91E167C5191231,17,M
6E4B2C80EA83C5B51675BC077A182E06,19,M
8AEB19A68DE4C301154973E0C87C774D,19,M
FA533E7DD1FCAB83822D4A2FC514AEE7,20,F
208F39485A8FBA7214CA004B53CC29B7,20,M
3172B95E8A5732D2EDB089A354787612,20,M
38FBAE52AAB04E56AB3A35C4AFBF3813,20,M
62D8A33130574C984EAF3A2E80C243A6,20,M
646AF35E192885FE1336649BA3735FC4,20,M
778D2B1354F30ED3D3BDD7B9437CF670,20,M
891312933FE5FB25703B3E958EF943E3,21,F
5CC63DFC5EF399F65CB9BC583A770DFB,21,M
97C9ED5750EC6FF618A059085F0BF4A6,21,M
C6F03C5E3CC37249E0308A09292F5A58,21,M
0713C427BDC3A90F291AF49599987F38,23,F
518EBD35FCDAB7C744334F993D8CEB28,23,F
A3FC185C4531AFF1E8B7B9F5985028CB,23,F
1837406921314CB182FB0C7BC5565204,23,M
CA0E6274BD39E8DE5C823F4E6F234252,23,M
E2F64030BB011C11541EDC192BAED09A,23,M
EF8F3D028C5759858AB7574864833015,23,M
5DCD427F26E05CC1C3F565BB05EAE10B,24,F
8646ED503722C3C6C6B44208EF1A5202,24,F
F9F45112C472282778E1F437F54B0B70,24,F
0CF4F20B89303CB4C03CF3BD3B27CAF0,24,M
63727039DFF77A46995DA8EDBC4E3497,24,M

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

data = pd.read_csv('/home/MedResearch/Desktop/drug_ids_age_gender.csv')
data.values

df = pd.DataFrame(data)
df2 = pd.merge(df1,df,  left_index = True, right_index = True)
temp1 = df2.groupby('Age').Age.count()

df3 = pd.merge(df1,df,  left_index = True, right_index = True)
temp2 = df3.groupby('Gender').Age.count()

ax = temp1.plot(kind='bar',stacked = False, colormap = 'Paired')
for p in ax.patches:
    ax.annotate(np.round(p.get_height(),decimals=0), (p.get_x()+p.get_width()/2.,   p.get_height()), ha='center', va='center', xytext=(2, 10), textcoords='offset points')
plt.ylabel('Numbers Of Patients Taking the drug')
plt.title('Age Distribution Of Patients Based Upon drug Intake')
plt.show()

Got something like this as a result:

Answer 1

This questions come back often, so I decided to write a step by step explanation. Note that I'm not a pandas guru, so there are things that could probably be optimized.

I started by generating getting a list of ages that I will use for my x-axis:

cvsdata = '''Drug_ID,Age,Gender
FF90E5F780133BBCAEE9BA0E4654C5CA,15,F
2AB2529352E6A1D0BA91E167C5191231,17,M
6E4B2C80EA83C5B51675BC077A182E06,19,M
8AEB19A68DE4C301154973E0C87C774D,19,M
FA533E7DD1FCAB83822D4A2FC514AEE7,20,F
208F39485A8FBA7214CA004B53CC29B7,20,M
3172B95E8A5732D2EDB089A354787612,20,M
38FBAE52AAB04E56AB3A35C4AFBF3813,20,M
62D8A33130574C984EAF3A2E80C243A6,20,M
646AF35E192885FE1336649BA3735FC4,20,M
778D2B1354F30ED3D3BDD7B9437CF670,20,M
891312933FE5FB25703B3E958EF943E3,21,F
5CC63DFC5EF399F65CB9BC583A770DFB,21,M
97C9ED5750EC6FF618A059085F0BF4A6,21,M
C6F03C5E3CC37249E0308A09292F5A58,21,M
0713C427BDC3A90F291AF49599987F38,23,F
518EBD35FCDAB7C744334F993D8CEB28,23,F
A3FC185C4531AFF1E8B7B9F5985028CB,23,F
1837406921314CB182FB0C7BC5565204,23,M
CA0E6274BD39E8DE5C823F4E6F234252,23,M
E2F64030BB011C11541EDC192BAED09A,23,M
EF8F3D028C5759858AB7574864833015,23,M
5DCD427F26E05CC1C3F565BB05EAE10B,24,F
8646ED503722C3C6C6B44208EF1A5202,24,F
F9F45112C472282778E1F437F54B0B70,24,F
0CF4F20B89303CB4C03CF3BD3B27CAF0,24,M
63727039DFF77A46995DA8EDBC4E3497,24,M'''

df = pd.read_csv(StringIO.StringIO(cvsdata))
ages = df.Age.unique()

array([15, 17, 19, 20, 21, 23, 24])

Then I generated a grouped dataframe with the counts of each M and F per age:

counts = df.groupby(['Age','Gender']).count()
print counts

            Drug_ID
Age Gender         
15  F             1
17  M             1
19  M             2
20  F             1
    M             6
21  F             1
    M             3
23  F             3
    M             4
24  F             3
    M             2

Using that, I can easily calculate the total number of individual per age group:

totals = counts.sum(level=0)
print totals

     Drug_ID
Age         
15         1
17         1
19         2
20         7
21         4
23         7
24         5

To prepare for plotting, I'll transform my counts dataframe to separate each sex by columns, instead of by index. Here I also drop that 'Drug_ID' column name because the unstack() operation creates a MultiIndex and it's much easier to manipulate the dataframe without that MultiIndex.

counts = counts.unstack(level=1)
counts.columns = counts.columns.droplevel(level=0)
print counts

Gender    F    M
Age             
15      1.0  NaN
17      NaN  1.0
19      NaN  2.0
20      1.0  6.0
21      1.0  3.0
23      3.0  4.0
24      3.0  2.0

Looks pretty good. I'll just do a final refinement and replace the NaN by 0.

counts = counts.fillna(0)
print counts

Gender    F    M
Age             
15      1.0  0.0
17      0.0  1.0
19      0.0  2.0
20      1.0  6.0
21      1.0  3.0
23      3.0  4.0
24      3.0  2.0

With this dataframe, it is trivial to plot the stacked bars:

plt.bar(ages, counts['M'], bottom=None, color='blue', label='M')
plt.bar(ages, counts['F'], bottom=counts['M'], color='pink', label='F')
plt.legend()
plt.xlabel('Ages')
plt.ylabel('Count')

To plot the total counts on top of the bars, we'll use the annotate() function. We cannot do it in one single pass, instead we'll loop through the ages and the totals (for simplicity sake, I take the values and flatten() them because they're not quite in the right format, not exactly sure why here)

for age,tot in zip(ages,totals.values.flatten()):
    plt.annotate('N={:d}'.format(tot), xy=(age+0.4, tot), xytext=(0,5), textcoords='offset points', ha='center', va='bottom')

the coordinates for the annotations are (age+0.4, tot) because the bars go from x to x+width with width=0.8 by default, and therefore x+0.4 is the center of the bar, while tot is of course the full height of the bar. To offset the text a bit, I offset the text by a few (5) points in the y direction. Adjust according to your liking.

Check out the documentation for bar() to adjust the parameters of the bar plots. Check out the documentation for annotate() to customize your annotations

Stacked Bar Plot By Group Count On Pandas Python

Question

1 answers

solution1
5 ACCPTED 2016-11-24 18:20:20

Stacked Bar Plot By Group Count On Pandas Python

Question

1 answers

solution1 5 ACCPTED 2016-11-24 18:20:20

solution1
5 ACCPTED 2016-11-24 18:20:20