简体   繁体   English

plot两列之间如何关联

[英]How to plot correlation between two columns

The task is the following:任务如下:

Is there a correlation between the age of an athlete and his result at the Olympics in the entire dataset?在整个数据集中,运动员的年龄与他在奥运会上的成绩之间是否存在相关性?

Each athlete has a name, age, medal (gold, silver, bronze or NA).每个运动员都有姓名、年龄、奖牌(金牌、银牌、铜牌或 NA)。

In my opinion, it is necessary to count the number of all athletes of the same age and calculate the percentage of them who have any kind of medal ( data.Medal.notnull() ).在我看来,有必要统计所有同龄运动员的人数,并计算他们获得任何奖牌的百分比( data.Medal.notnull() )。 The graph should show all ages on the x-axis, and the percentage of those who has any medal on the y-axis.该图应在 x 轴上显示所有年龄,在 y 轴上显示获得任何奖牌的人的百分比。 How to get this data and create the graphic with help of pandas and matprolib?如何在 pandas 和 matprolib 的帮助下获取这些数据并创建图形?

For instance, some data like in table:例如,表中的一些数据:

Name Age Medal
Name1 20 Silver 
Name2 21 NA
Name3 20 NA
Name4 22 Bronze
Name5 22 NA
Name6 21 NA
Name7 20 Gold
Name8 19 Silver
Name9 20 Gold
Name10 20 NA
Name11 21 Silver

The result should be (in the graphic):结果应该是(在图中):

19 - 100%
20 - 60%
21 - 33%
22 - 50%

First, turn df.Medal into 1 s for a medal and 0 s for NaN values using np.where .首先,使用np.wheredf.Medal变成1 s 表示奖牌, 0 s 表示NaN值。

import pandas as pd
import numpy as np

data = {'Name': {0: 'Name1', 1: 'Name2', 2: 'Name3', 3: 'Name4', 4: 'Name5', 
                 5: 'Name6', 6: 'Name7', 7: 'Name8', 8: 'Name9', 9: 'Name10', 
                 10: 'Name11'}, 
        'Age': {0: 20, 1: 21, 2: 20, 3: 22, 4: 22, 5: 21, 6: 20, 7: 19, 8: 20, 
                9: 20, 10: 21}, 
        'Medal': {0: 'Silver', 1: np.nan, 2: np.nan, 3: 'Bronze', 4: np.nan, 
                  5: np.nan, 6: 'Gold', 7: 'Silver', 8: 'Gold', 9: np.nan, 
                  10: 'Silver'}}

df = pd.DataFrame(data)

df.Medal = np.where(df.Medal.notna(),1,0)

print(df)

      Name  Age  Medal
0    Name1   20      1
1    Name2   21      0
2    Name3   20      0
3    Name4   22      1
4    Name5   22      0
5    Name6   21      0
6    Name7   20      1
7    Name8   19      1
8    Name9   20      1
9   Name10   20      0
10  Name11   21      1

Now, you could plot the data maybe as follows:现在,您可以 plot 数据可能如下所示:

import seaborn as sns
import matplotlib.ticker as mtick

sns.set_theme()
ax = sns.barplot(data=df, x=df.Age, y=df.Medal, errorbar=None)

# in versions prior to `seaborn 0.12` use
# `ax = sns.barplot(data=df, x=df.Age, y=df.Medal, ci=None)`

ax.yaxis.set_major_formatter(mtick.PercentFormatter(xmax=1.0))

# adding labels
ax.bar_label(ax.containers[0], 
             labels=[f'{round(v*100,2)}%' for v in ax.containers[0].datavalues])

Result:结果:

条形图


Incidentally, if you would have wanted to calculate these percentages, one option could have been to use pd.crosstab :顺便说一句,如果您想计算这些百分比,一种选择可能是使用pd.crosstab

percentages = pd.crosstab(df.Age,df.Medal, normalize='index')\
    .rename(columns={1:'percentages'})['percentages']
print(percentages)

Age
19    1.000000
20    0.600000
21    0.333333
22    0.500000
Name: percentages, dtype: float64

So, with matplotlib , you could also do something like:因此,对于matplotlib ,您还可以执行以下操作:

percentages = pd.crosstab(df.Age,df.Medal, normalize='index')\
    .rename(columns={1:'percentages'})['percentages'].mul(100)
    
my_cmap = plt.get_cmap("viridis")
rescale = lambda y: (y - np.min(y)) / (np.max(y) - np.min(y))

fig, ax = plt.subplots()

ax.bar(x=percentages.index.astype(str), 
             height=percentages.to_numpy(), 
             color=my_cmap(rescale(percentages.to_numpy())))

ax.yaxis.set_major_formatter(mtick.PercentFormatter())
ax.bar_label(ax.containers[0], fmt='%.1f%%')
    
plt.show()

Result:结果:

条形图

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM