简体   繁体   English

为什么我无法将现有数据添加到新的 dataframe 列中?

[英]Why am I unable to add an existing data into a new dataframe column?

I have a dataset containing 250 employee names, gender and their salary.我有一个包含 250 个员工姓名、性别和薪水的数据集。 I am trying to create a new dataframe to simply 'extract' the salary for males and females respectively.我正在尝试创建一个新的 dataframe 来简单地分别“提取”男性和女性的工资。 This dataframe would have 2 columns, one with Male Salaries and another with Female Salaries.此 dataframe 将有 2 列,一列是男性工资,另一列是女性工资。

From this dataframe, I would like to create a side by side boxplot with matplotlib to analyse if there is any gender wage gap.从这个 dataframe,我想用 matplotlib 创建一个并排的箱线图来分析是否存在任何性别工资差距。

# Import libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

df = pd.read_csv("TMA_Data.csv")
df.head()

#Filter out female employees
df_female = df[(df.Gender == "F")]
df_female.head()

#Filter out male employees
df_male = df[(df.Gender == "M ")]
df_male.head()

#Create new dataframe with only salaries
df2 = pd.DataFrame(columns = ["Male Salaries", "Female Salaries"])
print(df2)

#Assign Male Salaries column
df2["Male Salaries"] = df_male["Salary"]
df2.head() #This works

Output: 
    Male Salaries   Female Salaries
3   93046   NaN
7   66808   NaN
10  46998   NaN
16  74312   NaN
17  50178   NaN

#Assign Female Salaries column (THIS IS WHERE THE PROBLEM LIES)
df2["Female Salaries"] = df_female["Salary"]
df2.head()

Output:     
Male Salaries   Female Salaries
3   93046   NaN
7   66808   NaN
10  46998   NaN
16  74312   NaN
17  50178   NaN

How come I am unable to add the values for female salaries (nothing seems to be added)?为什么我无法添加女性工资的值(似乎没有添加)? Also, given that my eventual goal is to create two side-by-side boxplots, feel free to suggest if I can do this in a completely different way.此外,鉴于我的最终目标是创建两个并排的箱线图,请随意建议我是否可以以完全不同的方式来做到这一点。 Thank you very much!非常感谢!

Edit: Dataset preview: Dataset contents编辑:数据集预览:数据集内容

Solution:解决方案:

Use .reset_index :使用.reset_index

df2 = pd.DataFrame(columns = ["Male Salaries", "Female Salaries"])
df2["Male Salaries"] = df_male["Salary"].reset_index(drop=True)
df2["Female Salaries"] = df_female["Salary"].reset_index(drop=True)

Explanation:解释:

When setting values of a column of the dataframe, they are set to their respective indices.当设置 dataframe 的列的值时,它们被设置为各自的索引。

And your Male and Female indices are obviously different, since they came from different rows of the initial dataframe.而且您的MaleFemale指数明显不同,因为它们来自初始 dataframe 的不同行。

Example:例子:

df = pd.DataFrame([[1], [2], [3]])
df
    0
0   1
1   2
2   3

Works as you expected:按预期工作:

df[1] = [4, 5, 6]
df
    0   1
0   1   4
1   2   5
2   3   6

Works NOT as you expected:不像您预期的那样工作:

df[2] = pd.Series([4, 5, 6], index=[1, 0, 999])
df
    0   1   2
0   1   4   5.0
1   2   5   4.0
2   3   6   NaN

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM