简体   繁体   English

Pandas dataframe 根据条件更改列中的值

[英]Pandas dataframe change values in a column based on conditions

I have a large Dataframe below:我在下面有一个大的 Dataframe:

The data used as the example here 'education_val.csv' can be found here https://github.com/ENLK/Py-Projects-/blob/master/education_val.csv此处用作示例的数据“education_val.csv”可以在此处找到https://github.com/ENLK/Py-Projects-/blob/master/education_val.csv

import pandas as pd 

edu = pd.read_csv('education_val.csv')
del edu['Unnamed: 0']
edu.head(10)

ID  Year    Education
22445   1991    higher education
29925   1991    No qualifications
76165   1991    No qualifications
223725  1991    Other
280165  1991    intermediate qualifications
333205  1991    No qualifications
387605  1991    higher education
541285  1991    No qualifications
541965  1991    No qualifications
599765  1991    No qualifications

The values in the column Education are: Education列中的值是:

edu.Education.value_counts()

intermediate qualifications 153705
higher education    67020
No qualifications   55842
Other   36915

I want to replace the values in the column Education in the following ways:我想通过以下方式替换 Education 列中的值:

  1. If an ID has the value higher education in a year in the column Education then all future years for that ID will also have higher education in the Education column.如果某个IDEducation列中的某年具有higher education值,则该ID的所有未来年份在Education列中也将具有higher education

  2. If an ID has the value intermediate qualifications in a year, then all future years for that ID will have intermediate qualifications in the corresponding Education column.如果一个ID在一年中具有值intermediate qualifications ,则该ID的所有未来年份将在相应的Education列中具有intermediate qualifications However, if the value higher education occurs in any of the subsequent years for this ID , then higher education replaces intermediate qualifications in the subsequent years, regardless if Other or No qualifications occur .但是,如果值higher education出现在此ID的任何后续年份,则higher education将替换后续年份的intermediate qualifications ,无论是否出现“ Other ”或No qualifications occur

For example in the DataFrame below, ID 22445 has the value higher education in the year 1991 , all subsequent values of Education for 22445 should be replaced with higher education in the later years, up to the year 2017 .例如,在下面的 DataFrame 中, ID 22445 具有1991higher education值,所有后续的22445 Education值都应替换为晚年的higher education ,直到2017

edu.loc[edu['ID'] == 22445]

ID  Year    Education
22445   1991    higher education
22445   1992    higher education
22445   1993    higher education
22445   1994    higher education
22445   1995    higher education
22445   1996    intermediate qualifications
22445   1997    intermediate qualifications
22445   1998    Other
22445   1999    No qualifications
22445   2000    intermediate qualifications
22445   2001    intermediate qualifications
22445   2002    intermediate qualifications
22445   2003    intermediate qualifications
22445   2004    intermediate qualifications
22445   2005    intermediate qualifications
22445   2006    intermediate qualifications
22445   2007    intermediate qualifications
22445   2008    intermediate qualifications
22445   2010    intermediate qualifications
22445   2011    intermediate qualifications
22445   2012    intermediate qualifications
22445   2013    intermediate qualifications
22445   2014    intermediate qualifications
22445   2015    intermediate qualifications
22445   2016    intermediate qualifications
22445   2017    intermediate qualifications

Similarly, ID 1587125 in the Dataframe below has the value intermediate qualifications in the year 1991 , and changes to higher education in 1993 .同样,下面Dataframe中的ID 1587125在1991具有intermediate qualifications1993改为higher education All subsequent values in the column Education in the future years (from 1993 onwards) for 1587125 should be higher education . 1587125Education in the future years(从 1993 年起)中的所有后续值都应该是 Higher higher education

edu.loc[edu['ID'] == 1587125]

ID  Year    Education
1587125 1991    intermediate qualifications
1587125 1992    intermediate qualifications
1587125 1993    higher education
1587125 1994    higher education
1587125 1995    higher education
1587125 1996    higher education
1587125 1997    higher education
1587125 1998    higher education
1587125 1999    higher education
1587125 2000    higher education
1587125 2001    higher education
1587125 2002    higher education
1587125 2003    higher education
1587125 2004    Other
1587125 2005    No qualifications
1587125 2006    intermediate qualifications
1587125 2007    intermediate qualifications
1587125 2008    intermediate qualifications
1587125 2010    intermediate qualifications
1587125 2011    higher education
1587125 2012    higher education
1587125 2013    higher education
1587125 2014    higher education
1587125 2015    higher education
1587125 2016    higher education
1587125 2017    higher education

There are 12,057 unique ID in the data and the column Year spans from 1991 to 2017. How does one change the values of Education for all 12, 057 according to the above conditions?数据中有 12,057 个唯一ID ,并且Year列从 1991 年到 2017 年。如何根据上述条件更改所有 12,057 的Education值? I'm not sure how to do this in a uniform way for all unique ID s.我不确定如何以统一的方式对所有唯一ID执行此操作。 The sample data used as the example here is attached in the Github link above.此处用作示例的示例数据附在上面的 Github 链接中。 Many thanks in advance.提前谢谢了。

You can do it using the categorical data like this:您可以使用这样的分类数据来做到这一点:

df = pd.read_csv('https://raw.githubusercontent.com/ENLK/Py-Projects-/master/education_val.csv')

eddtype = pd.CategoricalDtype(['No qualifications', 
                               'Other',
                               'intermediate qualifications',
                               'higher education'], 
                               ordered=True)
df['EducationCat'] = df['Education'].str.strip().astype(eddtype)

df['EduMax'] = df.sort_values('Year').groupby('ID')['EducationCat']\
                 .transform(lambda x: eddtype.categories[x.cat.codes.cummax()] )

It is broken it up explicitly so you can see the data manipulations I am using.它被明确地分解,所以你可以看到我正在使用的数据操作。

  1. Create a Education categorical dtype with order创建具有顺序的教育分类数据类型
  2. Next, change dtype of Education column to use that categorical dtype (EducationCat)接下来,更改 Education 列的 dtype 以使用该分类 dtype (EducationCat)
  3. Use the codes of the categorical to preform cummax calculation使用分类代码进行 cummax 计算
  4. With indexing to return the category defined by the cummax calculation (EduMax)通过索引返回由 cummax 计算定义的类别 (EduMax)

Outputs:输出:

df[df['ID'] == 1587125]

            ID  Year                    Education                 EducationCat                       EduMax
18      1587125  1991  intermediate qualifications  intermediate qualifications  intermediate qualifications
12075   1587125  1992  intermediate qualifications  intermediate qualifications  intermediate qualifications
24132   1587125  1993             higher education             higher education             higher education
36189   1587125  1994             higher education             higher education             higher education
48246   1587125  1995             higher education             higher education             higher education
60303   1587125  1996             higher education             higher education             higher education
72360   1587125  1997             higher education             higher education             higher education
84417   1587125  1998             higher education             higher education             higher education
96474   1587125  1999             higher education             higher education             higher education
108531  1587125  2000             higher education             higher education             higher education
120588  1587125  2001             higher education             higher education             higher education
132645  1587125  2002             higher education             higher education             higher education
144702  1587125  2003             higher education             higher education             higher education
156759  1587125  2004                        Other                        Other             higher education
168816  1587125  2005            No qualifications            No qualifications             higher education
180873  1587125  2006  intermediate qualifications  intermediate qualifications             higher education
192930  1587125  2007  intermediate qualifications  intermediate qualifications             higher education
204987  1587125  2008  intermediate qualifications  intermediate qualifications             higher education
217044  1587125  2010  intermediate qualifications  intermediate qualifications             higher education
229101  1587125  2011             higher education             higher education             higher education
241158  1587125  2012             higher education             higher education             higher education
253215  1587125  2013             higher education             higher education             higher education
265272  1587125  2014             higher education             higher education             higher education
277329  1587125  2015             higher education             higher education             higher education
289386  1587125  2016             higher education             higher education             higher education
301443  1587125  2017             higher education             higher education             higher education

There clearly is an order in the level of education.显然,教育水平是有顺序的。 Your problem can be restated as a "rolling max" problem: what is the highest level of education a person has as of a certain year?您的问题可以重述为“滚动最大值”问题:一个人在某一年的最高教育水平是多少?

Try this:尝试这个:

# A dictionary mapping each label to a rank
mappings = {e: i for i, e in enumerate(['No qualifications', 'Other', 'intermediate qualifications', 'higher education'])}

# Convert the label to its rank
edu['Education'] = edu['Education'].map(mappings)

# The gist of the solution: an expanding max level of education per person
tmp = edu.sort_values('Year').groupby('ID')['Education'].expanding().max()

# The first index level in tmp is the ID, the second level is the original index
# We only need the original index, hence the droplevel
# We also convert the rank back to the label (swapping keys and values in the mappings dictionary)
tmp = tmp.droplevel(0).map({v: k for k, v in mappings.items()})

edu['Education'] = tmp

Test:测试:

edu[edu['ID'] == 1587125]

    ID  Year                    Education
1587125  1991  intermediate qualifications
1587125  1992  intermediate qualifications
1587125  1993             higher education
1587125  1994             higher education
1587125  1995             higher education
1587125  1996             higher education
1587125  1997             higher education
1587125  1998             higher education
1587125  1999             higher education
1587125  2000             higher education
1587125  2001             higher education
1587125  2002             higher education
1587125  2003             higher education
1587125  2004             higher education
1587125  2005             higher education
1587125  2006             higher education
1587125  2007             higher education
1587125  2008             higher education
1587125  2010             higher education
1587125  2011             higher education
1587125  2012             higher education
1587125  2013             higher education
1587125  2014             higher education
1587125  2015             higher education
1587125  2016             higher education
1587125  2017             higher education

You could iterate through the IDs, then through the years.您可以遍历 ID,然后遍历这些年。 The DataFrame is ordered chronologically, so if a person has 'higher education' or 'intermediate qualifications' in a cell, you can save this knowledge and apply it to subsequent cells: DataFrame 是按时间顺序排列的,所以如果一个人在一个单元格中具有“高等教育”或“中级资格”,您可以保存这些知识并将其应用于后续单元格:

edu = edu.set_index('ID')
ids = edu.index.unique()

for id in ids:
    # booleans to keep track of education statuses we've seen
    higher_ed = False
    inter_qual = False

    rows = edu.loc[id]
    for _, row in rows:
        # check for intermediate qualifications
        if inter_qual:
            row['Education'] = 'intermediate qualifications'
        elif row['Education'] = 'intermediate qualifications':
            inter_qual = True

        # check for higher education
        if higher_ed:
            row['Education'] = 'higher education'
        elif row['Education'] = 'higher education':
            higher_ed = True

It doesn't matter that we're potentially overwriting each status more than once — if a person has both 'intermediate qualifications' and 'higher education', we only need to be sure that 'higher education' is set last.我们可能不止一次地覆盖每个状态并不重要——如果一个人同时具有“中级资格”和“高等教育”,我们只需要确保最后设置“高等教育”即可。

I would normally not suggest using a for loop to process a DataFrame — but each cell value might rely on values above it, and the Dataframe isn't so large as to make this infeasible.我通常不建议使用 for 循环来处理 DataFrame - 但每个单元格值可能依赖于它上面的值,并且 Dataframe 并没有大到不可行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python:根据Python中的多个条件更改pandas DataFrame列中的值 - Python: Change values in a pandas DataFrame column based on multiple conditions in Python 根据pandas DataFrame中的条件替换列中的值 - Replacing values in column based on conditions in pandas DataFrame 根据列条件交换列值(Pandas DataFrame) - Swapping column values based on column conditions (Pandas DataFrame) pandas dataframe 中的新列基于现有列值和条件列表 - New column in pandas dataframe based on existing column values with conditions list 根据熊猫数据框中的多个列值和条件替换值 - Replacing values based on multiple column values and conditions in pandas dataframe 如何根据多个条件更改熊猫数据框列系列中的特定单元格值? - How to change specific cell values in a pandas dataframe column series based on multiple conditions? pandas dataframe - 根据列标题更改值 - pandas dataframe - change values based on column heading 根据熊猫中的另一个数据框更改列中的值 - Change values in column based on antoher dataframe in pandas Pandas DataFrame 根据多个条件分组添加新列值 - Pandas DataFrame add new column values based on group by multiple conditions 根据熊猫数据框中的两个条件创建一列布尔值 - Making a column of boolean values based on two conditions in pandas dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM