[英]Pandas: lookup values in DataFrame, where source column has multiple members
我有一個帶有數據的 DF 和一個表示用於查詢和返回數據的數據庫的 DF。 我無法使用合並,因為某些行包含多個查找。
數據:
df_data = pd.DataFrame([[1000, 'Jerry', 'BR1001, BR1003, BR9009','',''],
[1001, 'Buck', 'BR1010, BR1011','',''],
[1002, 'Melanie', 'BR3009','','DPT2002'],
[1003, 'Perry','BR4009','',''],
[1004, 'Perry2','','DIST1000',''],
[1005, 'Eloise','','','DPT9009'],
[1005, 'Sharon','','','DPT9009']],
columns=['ID', 'Name', 'School Number','District Number','Dept. Number'])
給定School Number
,我需要能夠提取所有關聯的District Number
和Dept. Number
。 我只想專注於提取District Number
s。 問題是如何在一個以上的字段中迭代成員。
要查詢的數據:
df_DB = pd.DataFrame([['DIST1000', 'BR1001', 'DPT9009','Physics'],
['DIST1000', 'BR1003', 'DPT1010','Biology'],
['DIST1000', 'BR1003', 'DPT1011','Sociology'],
['DIST1000', 'BR1010', 'DPT1012','Philosophy'],
['DIST1000', 'BR1011', 'DPT1013','Pre-K'],
['DIST1000', 'BR1012', 'DPT1014','Geology'],
['DIST1001', 'BR9009', 'DPT2001', 'Math'],
['DIST1001', 'BR3009', 'DPT2002', 'Physics'],
['DIST1001', 'BR9009', 'DPT2003', 'Pre-K'],
['DIST1001', 'BR4009', 'DPT2004', 'Economics']],
columns=['District Number', 'School Number', 'Dept. Number','Name'])
例如,請注意上面數據中的第一條記錄Jerry 。 他的記錄中分配了 3 個School Number
。
所需 output(示例):
1000, 'Jerry', 'BR1001, BR1003, BR9009','DIST1001, DIST1000','DPT9009, DPT1010, DPT1011, DPT2001, DPT2003'
為此我需要 function 嗎? 如果我能找到地區號碼,我想我可以弄清楚部門。
# Changing type from string to list.
df_data['School Number'] = df_data['School Number'].apply(lambda x: x.split(", "))
# Expanding list into rows, selecting the desired columns from both tables,
# grouping by the ID, and returning the list of district numbers and schools along with the
# first name per ID (which is a guaranteed 1:1), then we join these values with a comma after turning them into a set (no duplicates).
df_data.explode('School Number')[['ID', 'Name', 'School Number']].merge(df_DB[['School Number', 'District Number']], left_on='School Number', right_on='School Number').groupby('ID').agg({'Name': 'first', 'School Number': lambda x: ', '.join(set(x)), 'District Number': lambda x: ', '.join(set(x))})
Output:
Name School Number District Number
ID
1000 Jerry BR1001, BR9009, BR1003 DIST1000, DIST1001
1001 Buck BR1011, BR1010 DIST1000
1002 Melanie BR3009 DIST1001
1003 Perry BR4009 DIST1001
或者,對於左連接:
df_data.explode('School Number')[['ID', 'Name', 'School Number']].merge(df_DB[['School Number', 'District Number']], left_on='School Number', right_on='School Number', how='left').groupby('ID').agg({'Name': 'first', 'School Number': lambda x: ', '.join(set(x)), 'District Number': lambda x: ', '.join(set([y for y in x if y == y]))})
Output:
Name School Number District Number
ID
1000 Jerry BR1003, BR1001, BR9009 DIST1000, DIST1001
1001 Buck BR1011, BR1010 DIST1000
1002 Melanie BR3009 DIST1001
1003 Perry BR4009 DIST1001
1004 Perry2
1005 Eloise
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.