[英]Pandas dataframe change values in a column based on conditions
I have a large Dataframe below:我在下面有一个大的 Dataframe:
The data used as the example here 'education_val.csv' can be found here https://github.com/ENLK/Py-Projects-/blob/master/education_val.csv此处用作示例的数据“education_val.csv”可以在此处找到https://github.com/ENLK/Py-Projects-/blob/master/education_val.csv
import pandas as pd
edu = pd.read_csv('education_val.csv')
del edu['Unnamed: 0']
edu.head(10)
ID Year Education
22445 1991 higher education
29925 1991 No qualifications
76165 1991 No qualifications
223725 1991 Other
280165 1991 intermediate qualifications
333205 1991 No qualifications
387605 1991 higher education
541285 1991 No qualifications
541965 1991 No qualifications
599765 1991 No qualifications
The values in the column Education
are: Education
列中的值是:
edu.Education.value_counts()
intermediate qualifications 153705
higher education 67020
No qualifications 55842
Other 36915
I want to replace the values in the column Education in the following ways:我想通过以下方式替换 Education 列中的值:
If an ID
has the value higher education
in a year in the column Education
then all future years for that ID
will also have higher education
in the Education
column.如果某个ID
在Education
列中的某年具有higher education
值,则该ID
的所有未来年份在Education
列中也将具有higher education
。
If an ID
has the value intermediate qualifications
in a year, then all future years for that ID
will have intermediate qualifications
in the corresponding Education
column.如果一个ID
在一年中具有值intermediate qualifications
,则该ID
的所有未来年份将在相应的Education
列中具有intermediate qualifications
。 However, if the value higher education
occurs in any of the subsequent years for this ID
, then higher education
replaces intermediate qualifications
in the subsequent years, regardless if Other
or No qualifications occur
.但是,如果值higher education
出现在此ID
的任何后续年份,则higher education
将替换后续年份的intermediate qualifications
,无论是否出现“ Other
”或No qualifications occur
。
For example in the DataFrame below, ID
22445 has the value higher education
in the year 1991
, all subsequent values of Education
for 22445
should be replaced with higher education
in the later years, up to the year 2017
.例如,在下面的 DataFrame 中, ID
22445 具有1991
的higher education
值,所有后续的22445
Education
值都应替换为晚年的higher education
,直到2017
。
edu.loc[edu['ID'] == 22445]
ID Year Education
22445 1991 higher education
22445 1992 higher education
22445 1993 higher education
22445 1994 higher education
22445 1995 higher education
22445 1996 intermediate qualifications
22445 1997 intermediate qualifications
22445 1998 Other
22445 1999 No qualifications
22445 2000 intermediate qualifications
22445 2001 intermediate qualifications
22445 2002 intermediate qualifications
22445 2003 intermediate qualifications
22445 2004 intermediate qualifications
22445 2005 intermediate qualifications
22445 2006 intermediate qualifications
22445 2007 intermediate qualifications
22445 2008 intermediate qualifications
22445 2010 intermediate qualifications
22445 2011 intermediate qualifications
22445 2012 intermediate qualifications
22445 2013 intermediate qualifications
22445 2014 intermediate qualifications
22445 2015 intermediate qualifications
22445 2016 intermediate qualifications
22445 2017 intermediate qualifications
Similarly, ID
1587125 in the Dataframe below has the value intermediate qualifications
in the year 1991
, and changes to higher education
in 1993
.同样,下面Dataframe中的ID
1587125在1991
具有intermediate qualifications
, 1993
改为higher education
。 All subsequent values in the column Education
in the future years (from 1993 onwards) for 1587125
should be higher education
. 1587125
列Education
in the future years(从 1993 年起)中的所有后续值都应该是 Higher higher education
。
edu.loc[edu['ID'] == 1587125]
ID Year Education
1587125 1991 intermediate qualifications
1587125 1992 intermediate qualifications
1587125 1993 higher education
1587125 1994 higher education
1587125 1995 higher education
1587125 1996 higher education
1587125 1997 higher education
1587125 1998 higher education
1587125 1999 higher education
1587125 2000 higher education
1587125 2001 higher education
1587125 2002 higher education
1587125 2003 higher education
1587125 2004 Other
1587125 2005 No qualifications
1587125 2006 intermediate qualifications
1587125 2007 intermediate qualifications
1587125 2008 intermediate qualifications
1587125 2010 intermediate qualifications
1587125 2011 higher education
1587125 2012 higher education
1587125 2013 higher education
1587125 2014 higher education
1587125 2015 higher education
1587125 2016 higher education
1587125 2017 higher education
There are 12,057 unique ID
in the data and the column Year
spans from 1991 to 2017. How does one change the values of Education
for all 12, 057 according to the above conditions?数据中有 12,057 个唯一ID
,并且Year
列从 1991 年到 2017 年。如何根据上述条件更改所有 12,057 的Education
值? I'm not sure how to do this in a uniform way for all unique ID
s.我不确定如何以统一的方式对所有唯一ID
执行此操作。 The sample data used as the example here is attached in the Github link above.此处用作示例的示例数据附在上面的 Github 链接中。 Many thanks in advance.提前谢谢了。
You can do it using the categorical data like this:您可以使用这样的分类数据来做到这一点:
df = pd.read_csv('https://raw.githubusercontent.com/ENLK/Py-Projects-/master/education_val.csv')
eddtype = pd.CategoricalDtype(['No qualifications',
'Other',
'intermediate qualifications',
'higher education'],
ordered=True)
df['EducationCat'] = df['Education'].str.strip().astype(eddtype)
df['EduMax'] = df.sort_values('Year').groupby('ID')['EducationCat']\
.transform(lambda x: eddtype.categories[x.cat.codes.cummax()] )
It is broken it up explicitly so you can see the data manipulations I am using.它被明确地分解,所以你可以看到我正在使用的数据操作。
Outputs:输出:
df[df['ID'] == 1587125]
ID Year Education EducationCat EduMax
18 1587125 1991 intermediate qualifications intermediate qualifications intermediate qualifications
12075 1587125 1992 intermediate qualifications intermediate qualifications intermediate qualifications
24132 1587125 1993 higher education higher education higher education
36189 1587125 1994 higher education higher education higher education
48246 1587125 1995 higher education higher education higher education
60303 1587125 1996 higher education higher education higher education
72360 1587125 1997 higher education higher education higher education
84417 1587125 1998 higher education higher education higher education
96474 1587125 1999 higher education higher education higher education
108531 1587125 2000 higher education higher education higher education
120588 1587125 2001 higher education higher education higher education
132645 1587125 2002 higher education higher education higher education
144702 1587125 2003 higher education higher education higher education
156759 1587125 2004 Other Other higher education
168816 1587125 2005 No qualifications No qualifications higher education
180873 1587125 2006 intermediate qualifications intermediate qualifications higher education
192930 1587125 2007 intermediate qualifications intermediate qualifications higher education
204987 1587125 2008 intermediate qualifications intermediate qualifications higher education
217044 1587125 2010 intermediate qualifications intermediate qualifications higher education
229101 1587125 2011 higher education higher education higher education
241158 1587125 2012 higher education higher education higher education
253215 1587125 2013 higher education higher education higher education
265272 1587125 2014 higher education higher education higher education
277329 1587125 2015 higher education higher education higher education
289386 1587125 2016 higher education higher education higher education
301443 1587125 2017 higher education higher education higher education
There clearly is an order in the level of education.显然,教育水平是有顺序的。 Your problem can be restated as a "rolling max" problem: what is the highest level of education a person has as of a certain year?您的问题可以重述为“滚动最大值”问题:一个人在某一年的最高教育水平是多少?
Try this:尝试这个:
# A dictionary mapping each label to a rank
mappings = {e: i for i, e in enumerate(['No qualifications', 'Other', 'intermediate qualifications', 'higher education'])}
# Convert the label to its rank
edu['Education'] = edu['Education'].map(mappings)
# The gist of the solution: an expanding max level of education per person
tmp = edu.sort_values('Year').groupby('ID')['Education'].expanding().max()
# The first index level in tmp is the ID, the second level is the original index
# We only need the original index, hence the droplevel
# We also convert the rank back to the label (swapping keys and values in the mappings dictionary)
tmp = tmp.droplevel(0).map({v: k for k, v in mappings.items()})
edu['Education'] = tmp
Test:测试:
edu[edu['ID'] == 1587125]
ID Year Education
1587125 1991 intermediate qualifications
1587125 1992 intermediate qualifications
1587125 1993 higher education
1587125 1994 higher education
1587125 1995 higher education
1587125 1996 higher education
1587125 1997 higher education
1587125 1998 higher education
1587125 1999 higher education
1587125 2000 higher education
1587125 2001 higher education
1587125 2002 higher education
1587125 2003 higher education
1587125 2004 higher education
1587125 2005 higher education
1587125 2006 higher education
1587125 2007 higher education
1587125 2008 higher education
1587125 2010 higher education
1587125 2011 higher education
1587125 2012 higher education
1587125 2013 higher education
1587125 2014 higher education
1587125 2015 higher education
1587125 2016 higher education
1587125 2017 higher education
You could iterate through the IDs, then through the years.您可以遍历 ID,然后遍历这些年。 The DataFrame is ordered chronologically, so if a person has 'higher education' or 'intermediate qualifications' in a cell, you can save this knowledge and apply it to subsequent cells: DataFrame 是按时间顺序排列的,所以如果一个人在一个单元格中具有“高等教育”或“中级资格”,您可以保存这些知识并将其应用于后续单元格:
edu = edu.set_index('ID')
ids = edu.index.unique()
for id in ids:
# booleans to keep track of education statuses we've seen
higher_ed = False
inter_qual = False
rows = edu.loc[id]
for _, row in rows:
# check for intermediate qualifications
if inter_qual:
row['Education'] = 'intermediate qualifications'
elif row['Education'] = 'intermediate qualifications':
inter_qual = True
# check for higher education
if higher_ed:
row['Education'] = 'higher education'
elif row['Education'] = 'higher education':
higher_ed = True
It doesn't matter that we're potentially overwriting each status more than once — if a person has both 'intermediate qualifications' and 'higher education', we only need to be sure that 'higher education' is set last.我们可能不止一次地覆盖每个状态并不重要——如果一个人同时具有“中级资格”和“高等教育”,我们只需要确保最后设置“高等教育”即可。
I would normally not suggest using a for loop to process a DataFrame — but each cell value might rely on values above it, and the Dataframe isn't so large as to make this infeasible.我通常不建议使用 for 循环来处理 DataFrame - 但每个单元格值可能依赖于它上面的值,并且 Dataframe 并没有大到不可行。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.