簡體   English   中英

Pandas dataframe 根據條件更改列中的值

[英]Pandas dataframe change values in a column based on conditions

我在下面有一個大的 Dataframe:

此處用作示例的數據“education_val.csv”可以在此處找到https://github.com/ENLK/Py-Projects-/blob/master/education_val.csv

import pandas as pd 

edu = pd.read_csv('education_val.csv')
del edu['Unnamed: 0']
edu.head(10)

ID  Year    Education
22445   1991    higher education
29925   1991    No qualifications
76165   1991    No qualifications
223725  1991    Other
280165  1991    intermediate qualifications
333205  1991    No qualifications
387605  1991    higher education
541285  1991    No qualifications
541965  1991    No qualifications
599765  1991    No qualifications

Education列中的值是:

edu.Education.value_counts()

intermediate qualifications 153705
higher education    67020
No qualifications   55842
Other   36915

我想通過以下方式替換 Education 列中的值:

  1. 如果某個IDEducation列中的某年具有higher education值,則該ID的所有未來年份在Education列中也將具有higher education

  2. 如果一個ID在一年中具有值intermediate qualifications ,則該ID的所有未來年份將在相應的Education列中具有intermediate qualifications 但是,如果值higher education出現在此ID的任何后續年份,則higher education將替換后續年份的intermediate qualifications ,無論是否出現“ Other ”或No qualifications occur

例如,在下面的 DataFrame 中, ID 22445 具有1991higher education值,所有后續的22445 Education值都應替換為晚年的higher education ,直到2017

edu.loc[edu['ID'] == 22445]

ID  Year    Education
22445   1991    higher education
22445   1992    higher education
22445   1993    higher education
22445   1994    higher education
22445   1995    higher education
22445   1996    intermediate qualifications
22445   1997    intermediate qualifications
22445   1998    Other
22445   1999    No qualifications
22445   2000    intermediate qualifications
22445   2001    intermediate qualifications
22445   2002    intermediate qualifications
22445   2003    intermediate qualifications
22445   2004    intermediate qualifications
22445   2005    intermediate qualifications
22445   2006    intermediate qualifications
22445   2007    intermediate qualifications
22445   2008    intermediate qualifications
22445   2010    intermediate qualifications
22445   2011    intermediate qualifications
22445   2012    intermediate qualifications
22445   2013    intermediate qualifications
22445   2014    intermediate qualifications
22445   2015    intermediate qualifications
22445   2016    intermediate qualifications
22445   2017    intermediate qualifications

同樣,下面Dataframe中的ID 1587125在1991具有intermediate qualifications1993改為higher education 1587125Education in the future years(從 1993 年起)中的所有后續值都應該是 Higher higher education

edu.loc[edu['ID'] == 1587125]

ID  Year    Education
1587125 1991    intermediate qualifications
1587125 1992    intermediate qualifications
1587125 1993    higher education
1587125 1994    higher education
1587125 1995    higher education
1587125 1996    higher education
1587125 1997    higher education
1587125 1998    higher education
1587125 1999    higher education
1587125 2000    higher education
1587125 2001    higher education
1587125 2002    higher education
1587125 2003    higher education
1587125 2004    Other
1587125 2005    No qualifications
1587125 2006    intermediate qualifications
1587125 2007    intermediate qualifications
1587125 2008    intermediate qualifications
1587125 2010    intermediate qualifications
1587125 2011    higher education
1587125 2012    higher education
1587125 2013    higher education
1587125 2014    higher education
1587125 2015    higher education
1587125 2016    higher education
1587125 2017    higher education

數據中有 12,057 個唯一ID ,並且Year列從 1991 年到 2017 年。如何根據上述條件更改所有 12,057 的Education值? 我不確定如何以統一的方式對所有唯一ID執行此操作。 此處用作示例的示例數據附在上面的 Github 鏈接中。 提前謝謝了。

您可以使用這樣的分類數據來做到這一點:

df = pd.read_csv('https://raw.githubusercontent.com/ENLK/Py-Projects-/master/education_val.csv')

eddtype = pd.CategoricalDtype(['No qualifications', 
                               'Other',
                               'intermediate qualifications',
                               'higher education'], 
                               ordered=True)
df['EducationCat'] = df['Education'].str.strip().astype(eddtype)

df['EduMax'] = df.sort_values('Year').groupby('ID')['EducationCat']\
                 .transform(lambda x: eddtype.categories[x.cat.codes.cummax()] )

它被明確地分解,所以你可以看到我正在使用的數據操作。

  1. 創建具有順序的教育分類數據類型
  2. 接下來,更改 Education 列的 dtype 以使用該分類 dtype (EducationCat)
  3. 使用分類代碼進行 cummax 計算
  4. 通過索引返回由 cummax 計算定義的類別 (EduMax)

輸出:

df[df['ID'] == 1587125]

            ID  Year                    Education                 EducationCat                       EduMax
18      1587125  1991  intermediate qualifications  intermediate qualifications  intermediate qualifications
12075   1587125  1992  intermediate qualifications  intermediate qualifications  intermediate qualifications
24132   1587125  1993             higher education             higher education             higher education
36189   1587125  1994             higher education             higher education             higher education
48246   1587125  1995             higher education             higher education             higher education
60303   1587125  1996             higher education             higher education             higher education
72360   1587125  1997             higher education             higher education             higher education
84417   1587125  1998             higher education             higher education             higher education
96474   1587125  1999             higher education             higher education             higher education
108531  1587125  2000             higher education             higher education             higher education
120588  1587125  2001             higher education             higher education             higher education
132645  1587125  2002             higher education             higher education             higher education
144702  1587125  2003             higher education             higher education             higher education
156759  1587125  2004                        Other                        Other             higher education
168816  1587125  2005            No qualifications            No qualifications             higher education
180873  1587125  2006  intermediate qualifications  intermediate qualifications             higher education
192930  1587125  2007  intermediate qualifications  intermediate qualifications             higher education
204987  1587125  2008  intermediate qualifications  intermediate qualifications             higher education
217044  1587125  2010  intermediate qualifications  intermediate qualifications             higher education
229101  1587125  2011             higher education             higher education             higher education
241158  1587125  2012             higher education             higher education             higher education
253215  1587125  2013             higher education             higher education             higher education
265272  1587125  2014             higher education             higher education             higher education
277329  1587125  2015             higher education             higher education             higher education
289386  1587125  2016             higher education             higher education             higher education
301443  1587125  2017             higher education             higher education             higher education

顯然,教育水平是有順序的。 您的問題可以重述為“滾動最大值”問題:一個人在某一年的最高教育水平是多少?

嘗試這個:

# A dictionary mapping each label to a rank
mappings = {e: i for i, e in enumerate(['No qualifications', 'Other', 'intermediate qualifications', 'higher education'])}

# Convert the label to its rank
edu['Education'] = edu['Education'].map(mappings)

# The gist of the solution: an expanding max level of education per person
tmp = edu.sort_values('Year').groupby('ID')['Education'].expanding().max()

# The first index level in tmp is the ID, the second level is the original index
# We only need the original index, hence the droplevel
# We also convert the rank back to the label (swapping keys and values in the mappings dictionary)
tmp = tmp.droplevel(0).map({v: k for k, v in mappings.items()})

edu['Education'] = tmp

測試:

edu[edu['ID'] == 1587125]

    ID  Year                    Education
1587125  1991  intermediate qualifications
1587125  1992  intermediate qualifications
1587125  1993             higher education
1587125  1994             higher education
1587125  1995             higher education
1587125  1996             higher education
1587125  1997             higher education
1587125  1998             higher education
1587125  1999             higher education
1587125  2000             higher education
1587125  2001             higher education
1587125  2002             higher education
1587125  2003             higher education
1587125  2004             higher education
1587125  2005             higher education
1587125  2006             higher education
1587125  2007             higher education
1587125  2008             higher education
1587125  2010             higher education
1587125  2011             higher education
1587125  2012             higher education
1587125  2013             higher education
1587125  2014             higher education
1587125  2015             higher education
1587125  2016             higher education
1587125  2017             higher education

您可以遍歷 ID,然后遍歷這些年。 DataFrame 是按時間順序排列的,所以如果一個人在一個單元格中具有“高等教育”或“中級資格”,您可以保存這些知識並將其應用於后續單元格:

edu = edu.set_index('ID')
ids = edu.index.unique()

for id in ids:
    # booleans to keep track of education statuses we've seen
    higher_ed = False
    inter_qual = False

    rows = edu.loc[id]
    for _, row in rows:
        # check for intermediate qualifications
        if inter_qual:
            row['Education'] = 'intermediate qualifications'
        elif row['Education'] = 'intermediate qualifications':
            inter_qual = True

        # check for higher education
        if higher_ed:
            row['Education'] = 'higher education'
        elif row['Education'] = 'higher education':
            higher_ed = True

我們可能不止一次地覆蓋每個狀態並不重要——如果一個人同時具有“中級資格”和“高等教育”,我們只需要確保最后設置“高等教育”即可。

我通常不建議使用 for 循環來處理 DataFrame - 但每個單元格值可能依賴於它上面的值,並且 Dataframe 並沒有大到不可行。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM