![](/img/trans.png)
[英]Python: Change values in a pandas DataFrame column based on multiple conditions in Python
[英]Pandas dataframe change values in a column based on conditions
我在下面有一個大的 Dataframe:
此處用作示例的數據“education_val.csv”可以在此處找到https://github.com/ENLK/Py-Projects-/blob/master/education_val.csv
import pandas as pd
edu = pd.read_csv('education_val.csv')
del edu['Unnamed: 0']
edu.head(10)
ID Year Education
22445 1991 higher education
29925 1991 No qualifications
76165 1991 No qualifications
223725 1991 Other
280165 1991 intermediate qualifications
333205 1991 No qualifications
387605 1991 higher education
541285 1991 No qualifications
541965 1991 No qualifications
599765 1991 No qualifications
Education
列中的值是:
edu.Education.value_counts()
intermediate qualifications 153705
higher education 67020
No qualifications 55842
Other 36915
我想通過以下方式替換 Education 列中的值:
如果某個ID
在Education
列中的某年具有higher education
值,則該ID
的所有未來年份在Education
列中也將具有higher education
。
如果一個ID
在一年中具有值intermediate qualifications
,則該ID
的所有未來年份將在相應的Education
列中具有intermediate qualifications
。 但是,如果值higher education
出現在此ID
的任何后續年份,則higher education
將替換后續年份的intermediate qualifications
,無論是否出現“ Other
”或No qualifications occur
。
例如,在下面的 DataFrame 中, ID
22445 具有1991
的higher education
值,所有后續的22445
Education
值都應替換為晚年的higher education
,直到2017
。
edu.loc[edu['ID'] == 22445]
ID Year Education
22445 1991 higher education
22445 1992 higher education
22445 1993 higher education
22445 1994 higher education
22445 1995 higher education
22445 1996 intermediate qualifications
22445 1997 intermediate qualifications
22445 1998 Other
22445 1999 No qualifications
22445 2000 intermediate qualifications
22445 2001 intermediate qualifications
22445 2002 intermediate qualifications
22445 2003 intermediate qualifications
22445 2004 intermediate qualifications
22445 2005 intermediate qualifications
22445 2006 intermediate qualifications
22445 2007 intermediate qualifications
22445 2008 intermediate qualifications
22445 2010 intermediate qualifications
22445 2011 intermediate qualifications
22445 2012 intermediate qualifications
22445 2013 intermediate qualifications
22445 2014 intermediate qualifications
22445 2015 intermediate qualifications
22445 2016 intermediate qualifications
22445 2017 intermediate qualifications
同樣,下面Dataframe中的ID
1587125在1991
具有intermediate qualifications
, 1993
改為higher education
。 1587125
列Education
in the future years(從 1993 年起)中的所有后續值都應該是 Higher higher education
。
edu.loc[edu['ID'] == 1587125]
ID Year Education
1587125 1991 intermediate qualifications
1587125 1992 intermediate qualifications
1587125 1993 higher education
1587125 1994 higher education
1587125 1995 higher education
1587125 1996 higher education
1587125 1997 higher education
1587125 1998 higher education
1587125 1999 higher education
1587125 2000 higher education
1587125 2001 higher education
1587125 2002 higher education
1587125 2003 higher education
1587125 2004 Other
1587125 2005 No qualifications
1587125 2006 intermediate qualifications
1587125 2007 intermediate qualifications
1587125 2008 intermediate qualifications
1587125 2010 intermediate qualifications
1587125 2011 higher education
1587125 2012 higher education
1587125 2013 higher education
1587125 2014 higher education
1587125 2015 higher education
1587125 2016 higher education
1587125 2017 higher education
數據中有 12,057 個唯一ID
,並且Year
列從 1991 年到 2017 年。如何根據上述條件更改所有 12,057 的Education
值? 我不確定如何以統一的方式對所有唯一ID
執行此操作。 此處用作示例的示例數據附在上面的 Github 鏈接中。 提前謝謝了。
您可以使用這樣的分類數據來做到這一點:
df = pd.read_csv('https://raw.githubusercontent.com/ENLK/Py-Projects-/master/education_val.csv')
eddtype = pd.CategoricalDtype(['No qualifications',
'Other',
'intermediate qualifications',
'higher education'],
ordered=True)
df['EducationCat'] = df['Education'].str.strip().astype(eddtype)
df['EduMax'] = df.sort_values('Year').groupby('ID')['EducationCat']\
.transform(lambda x: eddtype.categories[x.cat.codes.cummax()] )
它被明確地分解,所以你可以看到我正在使用的數據操作。
輸出:
df[df['ID'] == 1587125]
ID Year Education EducationCat EduMax
18 1587125 1991 intermediate qualifications intermediate qualifications intermediate qualifications
12075 1587125 1992 intermediate qualifications intermediate qualifications intermediate qualifications
24132 1587125 1993 higher education higher education higher education
36189 1587125 1994 higher education higher education higher education
48246 1587125 1995 higher education higher education higher education
60303 1587125 1996 higher education higher education higher education
72360 1587125 1997 higher education higher education higher education
84417 1587125 1998 higher education higher education higher education
96474 1587125 1999 higher education higher education higher education
108531 1587125 2000 higher education higher education higher education
120588 1587125 2001 higher education higher education higher education
132645 1587125 2002 higher education higher education higher education
144702 1587125 2003 higher education higher education higher education
156759 1587125 2004 Other Other higher education
168816 1587125 2005 No qualifications No qualifications higher education
180873 1587125 2006 intermediate qualifications intermediate qualifications higher education
192930 1587125 2007 intermediate qualifications intermediate qualifications higher education
204987 1587125 2008 intermediate qualifications intermediate qualifications higher education
217044 1587125 2010 intermediate qualifications intermediate qualifications higher education
229101 1587125 2011 higher education higher education higher education
241158 1587125 2012 higher education higher education higher education
253215 1587125 2013 higher education higher education higher education
265272 1587125 2014 higher education higher education higher education
277329 1587125 2015 higher education higher education higher education
289386 1587125 2016 higher education higher education higher education
301443 1587125 2017 higher education higher education higher education
顯然,教育水平是有順序的。 您的問題可以重述為“滾動最大值”問題:一個人在某一年的最高教育水平是多少?
嘗試這個:
# A dictionary mapping each label to a rank
mappings = {e: i for i, e in enumerate(['No qualifications', 'Other', 'intermediate qualifications', 'higher education'])}
# Convert the label to its rank
edu['Education'] = edu['Education'].map(mappings)
# The gist of the solution: an expanding max level of education per person
tmp = edu.sort_values('Year').groupby('ID')['Education'].expanding().max()
# The first index level in tmp is the ID, the second level is the original index
# We only need the original index, hence the droplevel
# We also convert the rank back to the label (swapping keys and values in the mappings dictionary)
tmp = tmp.droplevel(0).map({v: k for k, v in mappings.items()})
edu['Education'] = tmp
測試:
edu[edu['ID'] == 1587125]
ID Year Education
1587125 1991 intermediate qualifications
1587125 1992 intermediate qualifications
1587125 1993 higher education
1587125 1994 higher education
1587125 1995 higher education
1587125 1996 higher education
1587125 1997 higher education
1587125 1998 higher education
1587125 1999 higher education
1587125 2000 higher education
1587125 2001 higher education
1587125 2002 higher education
1587125 2003 higher education
1587125 2004 higher education
1587125 2005 higher education
1587125 2006 higher education
1587125 2007 higher education
1587125 2008 higher education
1587125 2010 higher education
1587125 2011 higher education
1587125 2012 higher education
1587125 2013 higher education
1587125 2014 higher education
1587125 2015 higher education
1587125 2016 higher education
1587125 2017 higher education
您可以遍歷 ID,然后遍歷這些年。 DataFrame 是按時間順序排列的,所以如果一個人在一個單元格中具有“高等教育”或“中級資格”,您可以保存這些知識並將其應用於后續單元格:
edu = edu.set_index('ID')
ids = edu.index.unique()
for id in ids:
# booleans to keep track of education statuses we've seen
higher_ed = False
inter_qual = False
rows = edu.loc[id]
for _, row in rows:
# check for intermediate qualifications
if inter_qual:
row['Education'] = 'intermediate qualifications'
elif row['Education'] = 'intermediate qualifications':
inter_qual = True
# check for higher education
if higher_ed:
row['Education'] = 'higher education'
elif row['Education'] = 'higher education':
higher_ed = True
我們可能不止一次地覆蓋每個狀態並不重要——如果一個人同時具有“中級資格”和“高等教育”,我們只需要確保最后設置“高等教育”即可。
我通常不建議使用 for 循環來處理 DataFrame - 但每個單元格值可能依賴於它上面的值,並且 Dataframe 並沒有大到不可行。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.