[英]Finding correct person among multiple person names
我有一个 dataframe。 在它的一列中有单个值,在其对应的列中有值的子集。
df = pd.DataFrame()
Index Values_1 Values_2
1 Muhammad bin Bashr bin al-Farafsa Isma'il bin Abi Khalid al-
Ahmsi [11418], Hisham bin
'Urwa [11065], Yahya bin
Sa'id bin Hiyan [11404]
1 Muhammad bin Bkar bin Bilal Sa'id bin Basahyr al-Azdi
[20710], Sa'id bin 'Abdul
'Aziz al-Tanuqi [20638]
1 Muhammad bin Bashar Bindar Mua'dh bin Hisham bin Aby
[20287], Yahya bin Sa'id bin
Farroukh al-Qatan [20031]
2 Yahya bin Sa'id bin Farroukh al-Qatan Y'aqub bin Ibrahim bin Kathir
[30400], Sh'uba[198]
2 Yahya bin Sa'd ibn Abi Waqqas Sa'd ibn Abi Waqqas [9]
3 Hamza bin al-Mughira bin Shu'ba al-Mughira ibn Shu'ba
[166]
3 Shu'ba Yahya bin Sa'id al khudri
我必须检查索引号 2 处的 Values_1 是否存在于索引号 1 处的任何 Values_2 中。按索引排列的第一个 groupby 值 例如,检查 Yahya bin Sa'id bin Farroukh al-Qatan 是否存在于任何 Values_2 中出现在索引 1
Output
Index Values_1 Values_2
1 Muhammad bin Bashar Bindar Mua'dh bin Hisham bin Aby
[20287], Yahya bin Sa'id
bin Farroukh al-Qatan
[20031]
2 Yahya bin Sa'id bin Farroukh al-Qatan Y'aqub bin Ibrahim bin Kathir
[30400], Sh'uba[198]
3 Shu'ba Yahya bin Sa'id al_Khudri
利用:
#convert values to list and subtract index by 1 for match by next group
s = df.groupby(level=0)['Values_1'].agg(list)
s.index = s.index - 1
print (s)
Index
0 [Muhammad bin Bashr bin al-Farafsa, Muhammad b...
1 [Yahya bin Sa'id bin Farroukh al-Qatan, Yahya ...
2 [Hamza bin al-Mughira bin Shu'ba, Shu'ba]
Name: Values_1, dtype: object
#replace NaN to emty list
df['test'] = df.index.map(s).map(lambda x: [] if isinstance(x, float) else x)
#test if at least one value match from list from previous group
f = lambda x: any([y in x['Values_2'] for y in x['test']])
mask = df.apply(f, axis=1)
#filter by mask and remove helper column
df = df[mask].drop('test',axis=1)
print (df)
Values_1 \
Index
1 Muhammad bin Bashar Bindar
Values_2
Index
1 Mua'dh bin Hisham bin Aby [20287], Yahya bin S...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.