简体   繁体   中英

Why am I unable to add an existing data into a new dataframe column?

I have a dataset containing 250 employee names, gender and their salary. I am trying to create a new dataframe to simply 'extract' the salary for males and females respectively. This dataframe would have 2 columns, one with Male Salaries and another with Female Salaries.

From this dataframe, I would like to create a side by side boxplot with matplotlib to analyse if there is any gender wage gap.

# Import libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

df = pd.read_csv("TMA_Data.csv")
df.head()

#Filter out female employees
df_female = df[(df.Gender == "F")]
df_female.head()

#Filter out male employees
df_male = df[(df.Gender == "M ")]
df_male.head()

#Create new dataframe with only salaries
df2 = pd.DataFrame(columns = ["Male Salaries", "Female Salaries"])
print(df2)

#Assign Male Salaries column
df2["Male Salaries"] = df_male["Salary"]
df2.head() #This works

Output: 
    Male Salaries   Female Salaries
3   93046   NaN
7   66808   NaN
10  46998   NaN
16  74312   NaN
17  50178   NaN

#Assign Female Salaries column (THIS IS WHERE THE PROBLEM LIES)
df2["Female Salaries"] = df_female["Salary"]
df2.head()

Output:     
Male Salaries   Female Salaries
3   93046   NaN
7   66808   NaN
10  46998   NaN
16  74312   NaN
17  50178   NaN

How come I am unable to add the values for female salaries (nothing seems to be added)? Also, given that my eventual goal is to create two side-by-side boxplots, feel free to suggest if I can do this in a completely different way. Thank you very much!

Edit: Dataset preview: Dataset contents

Solution:

Use .reset_index :

df2 = pd.DataFrame(columns = ["Male Salaries", "Female Salaries"])
df2["Male Salaries"] = df_male["Salary"].reset_index(drop=True)
df2["Female Salaries"] = df_female["Salary"].reset_index(drop=True)

Explanation:

When setting values of a column of the dataframe, they are set to their respective indices.

And your Male and Female indices are obviously different, since they came from different rows of the initial dataframe.

Example:

df = pd.DataFrame([[1], [2], [3]])
df
    0
0   1
1   2
2   3

Works as you expected:

df[1] = [4, 5, 6]
df
    0   1
0   1   4
1   2   5
2   3   6

Works NOT as you expected:

df[2] = pd.Series([4, 5, 6], index=[1, 0, 999])
df
    0   1   2
0   1   4   5.0
1   2   5   4.0
2   3   6   NaN

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM