![](/img/trans.png)
[英]Pandas - check if a string column in one dataframe contains a pair of strings from another dataframe
[英]Check if a column of strings from one dataframe contains a substring from a column in another dataframe, and output its mapped data
我有多個數據框。 第一個數據幀有某些字符串
df_string = pd.DataFrame({'idx':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'string':['xx01122txt01', 'bea2125', 'spddoc0010', 'bon007', 'xx001122xls04', 'spdxls1122', 'bea1234', 'bon1234', 'xy02125doc00', 'irnppt1260']})
其他三個數據幀具有帶有某種映射的子字符串
df_name = pd.DataFrame({'code':['bon', 'bea', 'spd'],
'name':['james bond', 'mr bean', 'spider man']})
df_type = pd.DataFrame({'code':['doc', 'txt', 'xls'],
'type':['document', 'text', 'excel']})
df_desc = pd.DataFrame({'id':['1122', '1234', '2990', '2125'],
'desc':['facebook', 'twitter', 'instagram', 'snapchat']})
我想要做的是,在字符串列中查找子字符串並使用映射數據創建一個新的數據框。 它需要看起來像這樣
df_output = pd.DataFrame({'idx': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'string': ['xx01122txt01', 'bea2125', 'spddoc0010', 'bon007', 'xx001122xls04', 'spdxls1122', 'bea1234', 'bon1234', 'xy02125doc00', 'irnppt1260'],
'desc': ['facebook', 'snapchat', '-', '-', 'facebook', 'facebook', 'twitter', 'twitter', 'snapchat', '-'],
'type': ['text', '-', 'document', '-', 'excel', 'excel', '-', '-', 'document', '-'],
'name': ['-', 'mr bean', 'spider man', 'james bond', '-', 'spider man', 'mr bean', 'james bond', '-', '-'
]})
您可以使用.str.extract
在string
列中提取匹配的字符串,然后將其map
到df_desc
的其他列, df_name
df_string['desc'] = df_string['string'].str.extract('('+'|'.join(df_desc['id'])+')')[0].map(df_desc.set_index('id')['desc'])
df_string['type'] = df_string['string'].str.extract('('+'|'.join(df_type['code'])+')')[0].map(df_type.set_index('code')['type'])
df_string['name'] = df_string['string'].str.extract('('+'|'.join(df_name['code'])+')')[0].map(df_name.set_index('code')['name'])
print(df_string)
idx string desc type name
0 1 xx01122txt01 facebook text NaN
1 2 bea2125 snapchat NaN mr bean
2 3 spddoc0010 NaN document spider man
3 4 bon007 NaN NaN james bond
4 5 xx001122xls04 facebook excel NaN
5 6 spdxls1122 facebook excel spider man
6 7 bea1234 twitter NaN mr bean
7 8 bon1234 twitter NaN james bond
8 9 xy02125doc00 snapchat document NaN
9 10 irnppt1260 NaN NaN NaN
print(df_string.fillna('-'))
idx string desc type name
0 1 xx01122txt01 facebook text -
1 2 bea2125 snapchat - mr bean
2 3 spddoc0010 - document spider man
3 4 bon007 - - james bond
4 5 xx001122xls04 facebook excel -
5 6 spdxls1122 facebook excel spider man
6 7 bea1234 twitter - mr bean
7 8 bon1234 twitter - james bond
8 9 xy02125doc00 snapchat document -
9 10 irnppt1260 - - -
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.