簡體   English   中英

檢查一個數據幀中的一列字符串是否包含另一個數據幀中一列的子字符串,並輸出其映射數據

[英]Check if a column of strings from one dataframe contains a substring from a column in another dataframe, and output its mapped data

我有多個數據框。 第一個數據幀有某些字符串

df_string = pd.DataFrame({'idx':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                   'string':['xx01122txt01', 'bea2125', 'spddoc0010', 'bon007', 'xx001122xls04', 'spdxls1122', 'bea1234', 'bon1234', 'xy02125doc00', 'irnppt1260']})

其他三個數據幀具有帶有某種映射的子字符串

df_name = pd.DataFrame({'code':['bon', 'bea', 'spd'],
                   'name':['james bond', 'mr bean', 'spider man']})
df_type = pd.DataFrame({'code':['doc', 'txt', 'xls'],
                   'type':['document', 'text', 'excel']})
df_desc = pd.DataFrame({'id':['1122', '1234', '2990', '2125'],
                   'desc':['facebook', 'twitter', 'instagram', 'snapchat']})

我想要做的是,在字符串列中查找字符串並使用映射數據創建一個新的數據框。 它需要看起來像這樣

df_output = pd.DataFrame({'idx': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                   'string': ['xx01122txt01', 'bea2125', 'spddoc0010', 'bon007', 'xx001122xls04', 'spdxls1122', 'bea1234', 'bon1234', 'xy02125doc00', 'irnppt1260'],
                   'desc': ['facebook', 'snapchat', '-', '-', 'facebook', 'facebook', 'twitter', 'twitter', 'snapchat', '-'],
                   'type': ['text', '-', 'document', '-', 'excel', 'excel', '-', '-', 'document', '-'],
                   'name': ['-', 'mr bean', 'spider man', 'james bond', '-', 'spider man', 'mr bean', 'james bond', '-', '-'
]})

您可以使用.str.extractstring列中提取匹配的字符串,然后將其mapdf_desc的其他列, df_name

df_string['desc'] = df_string['string'].str.extract('('+'|'.join(df_desc['id'])+')')[0].map(df_desc.set_index('id')['desc'])
df_string['type'] = df_string['string'].str.extract('('+'|'.join(df_type['code'])+')')[0].map(df_type.set_index('code')['type'])
df_string['name'] = df_string['string'].str.extract('('+'|'.join(df_name['code'])+')')[0].map(df_name.set_index('code')['name'])
print(df_string)


   idx         string      desc      type        name
0    1   xx01122txt01  facebook      text         NaN
1    2        bea2125  snapchat       NaN     mr bean
2    3     spddoc0010       NaN  document  spider man
3    4         bon007       NaN       NaN  james bond
4    5  xx001122xls04  facebook     excel         NaN
5    6     spdxls1122  facebook     excel  spider man
6    7        bea1234   twitter       NaN     mr bean
7    8        bon1234   twitter       NaN  james bond
8    9   xy02125doc00  snapchat  document         NaN
9   10     irnppt1260       NaN       NaN         NaN
print(df_string.fillna('-'))

   idx         string      desc      type        name
0    1   xx01122txt01  facebook      text           -
1    2        bea2125  snapchat         -     mr bean
2    3     spddoc0010         -  document  spider man
3    4         bon007         -         -  james bond
4    5  xx001122xls04  facebook     excel           -
5    6     spdxls1122  facebook     excel  spider man
6    7        bea1234   twitter         -     mr bean
7    8        bon1234   twitter         -  james bond
8    9   xy02125doc00  snapchat  document           -
9   10     irnppt1260         -         -           -

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM