I have a dataset containing 250 employee names, gender and their salary. I am trying to create a new dataframe to simply 'extract' the salary for males and females respectively. This dataframe would have 2 columns, one with Male Salaries and another with Female Salaries.
From this dataframe, I would like to create a side by side boxplot with matplotlib to analyse if there is any gender wage gap.
# Import libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df = pd.read_csv("TMA_Data.csv")
df.head()
#Filter out female employees
df_female = df[(df.Gender == "F")]
df_female.head()
#Filter out male employees
df_male = df[(df.Gender == "M ")]
df_male.head()
#Create new dataframe with only salaries
df2 = pd.DataFrame(columns = ["Male Salaries", "Female Salaries"])
print(df2)
#Assign Male Salaries column
df2["Male Salaries"] = df_male["Salary"]
df2.head() #This works
Output:
Male Salaries Female Salaries
3 93046 NaN
7 66808 NaN
10 46998 NaN
16 74312 NaN
17 50178 NaN
#Assign Female Salaries column (THIS IS WHERE THE PROBLEM LIES)
df2["Female Salaries"] = df_female["Salary"]
df2.head()
Output:
Male Salaries Female Salaries
3 93046 NaN
7 66808 NaN
10 46998 NaN
16 74312 NaN
17 50178 NaN
How come I am unable to add the values for female salaries (nothing seems to be added)? Also, given that my eventual goal is to create two side-by-side boxplots, feel free to suggest if I can do this in a completely different way. Thank you very much!
Edit: Dataset preview: Dataset contents
Use .reset_index
:
df2 = pd.DataFrame(columns = ["Male Salaries", "Female Salaries"])
df2["Male Salaries"] = df_male["Salary"].reset_index(drop=True)
df2["Female Salaries"] = df_female["Salary"].reset_index(drop=True)
When setting values of a column of the dataframe, they are set to their respective indices.
And your Male
and Female
indices are obviously different, since they came from different rows of the initial dataframe.
df = pd.DataFrame([[1], [2], [3]])
df
0
0 1
1 2
2 3
Works as you expected:
df[1] = [4, 5, 6]
df
0 1
0 1 4
1 2 5
2 3 6
Works NOT as you expected:
df[2] = pd.Series([4, 5, 6], index=[1, 0, 999])
df
0 1 2
0 1 4 5.0
1 2 5 4.0
2 3 6 NaN
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.