简体   繁体   中英

How to aggregate group metrics and plot data with pandas

I want to have a pie chart that compares survived people's age groups. The problem is I don't know how to count people with the same age. As you see in the bottom of screenshot, it says 142 columns. But, there are 891 people in the dataset.

import pandas as pd
import seaborn as sns  # for test data only

# load test data from seaborn
df_t = sns.load_dataset('titanic')

# capitalize the column headers to match code used below
df_t.columns = df_t.columns.str.title()

dft = df_t.groupby(['Age', 'Survived']).size().reset_index(name='count')

def get_num_people_by_age_category(dft):
    dft["age_group"] = pd.cut(x=dft['Age'], bins=[0,18,60,100], labels=["young","middle_aged","old"])
    return dft

# Call function
dft = get_num_people_by_age_category(dft)
print(dft)

output

在此处输入图片说明

Calling df_t.groupby(['Age', 'Survived']).size().reset_index(name='count') creates a dataframe with one line per age and per survived status.

To get the counts per age group, an "age group" column can be added to the original dataframe. And in a next step, groupby can use that "age group".

from matplotlib import pyplot as plt
import seaborn as sns  # to load the titanic dataset
import pandas as pd

df_t = sns.load_dataset('titanic')
df_t["age_group"] = pd.cut(x=df_t['age'], bins=[0, 18, 60, 100], labels=["young", "middle aged", "old"])

df_per_age = df_t.groupby(['age_group', 'survived']).size().reset_index(name='count')
labels = [f'{age_group},\n {"survived" if survived == 1 else "not survived"}'
          for age_group, survived in df_per_age[['age_group', 'survived']].values]
labels[-1] = labels[-1].replace('\n', ' ') # remove newline for the last items as the wedges are too thin
labels[-2] = labels[-2].replace('\n', ' ')
plt.pie(df_per_age['count'], labels=labels)
plt.tight_layout()
plt.show()

每个年龄组计数的饼图

  • The answer from @JohanC is great for a pie chart
  • I think the data is better presented as a bar plot, so this is an alternative, which can be done with pandas.DataFrame.plot and kind='bar' .
  • Reshape the data with pandas.crosstab , which creates a frequency cross tabulation table between the two factors.
  • Optionally include bar annotations using matplotlib.pyplot.bar_label
    • See this answer for additional details about this method.
import pandas as pd
import seaborn as sns

# load data
df = sns.load_dataset('titanic')
df.columns = df.columns.str.title()

# map 0 and 1 of Survived to a string
df.Survived = df.Survived.map({0: 'Died', 1: 'Survived'})

# bin the age
df['Age Group'] = pd.cut(x=df['Age'], bins=[0, 18, 60, 100], labels=['Young', 'Middle Aged', 'Senior'])

# Calculate the counts
ct = pd.crosstab(df['Survived'], df['Age Group'])

# display(ct)
Age Group  Young  Middle Aged  Senior
Survived                             
Died          69          338      17
Survived      70          215       5

# plot
ax = ct.plot(kind='bar', rot=0, xlabel='')

# optionally add annotations
for c in ax.containers:
    ax.bar_label(c, label_type='edge')
    
# pad the spacing between the number and the edge of the figure
ax.margins(y=0.1)

在此处输入图片说明

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM