[英]Pandas: Sort and Split a Column by Multiple Delimiters
I have a dataframe that has a very inconsistent column.我有一个 dataframe,它的列非常不一致。 For example:例如:
df = pd.DataFrame(columns=["CID", "CM"], data=[['xxx-1','skill_start=skill1,skill2,||skill_complete=skill1,'],['xxx-2','survey=1||skill_start=skill1,skill3||skill_complete=skill3'],['xxx-3','skill_start=skill2,skill3||skill_complete=skill2,skill3||abandon_custom=0']])
I am trying to split the CM column up.我正在尝试拆分 CM 列。 I tried this, and it got me very close:我试过了,它让我非常接近:
df = df.join(metrics['CM'].str.split('\|\|', expand=True).add_prefix('CM'))
But because the data is inconsistent, the columns don't line up cleanly.但是因为数据不一致,列没有整齐排列。 How do I split this up, in a sorted way?我如何以一种有序的方式拆分它?
Example desired output:所需示例 output:
['CID', 'survey', 'skill_start', 'skill_complete', 'abandon_custom'],['xxx-1','NaN','skill1,skill2','skill1','NaN'],['xxx-2','1','skill1,skill3','skill3','NaN'],['xxx-3','NaN','skill2,skill3','skill2,skill3','0']
Did you try this, using multiple delimiters, not sure if this is what you were looking for:你试过这个吗,使用多个定界符,不确定这是否是你要找的:
df1 = df['CM'].str.split('\|\||,|=', expand=True).add_prefix('CM_')
df = pd.concat([df['CID'], df1], axis=1)
print(df)
CID CM_0 CM_1 CM_2 CM_3 CM_4 CM_5 CM_6 CM_7
0 xxx-1 skill_start skill1 skill2 skill_complete skill1 None
1 xxx-2 survey 1 skill_start skill1 skill3 skill_complete skill3 None
2 xxx-3 skill_start skill2 skill3 skill_complete skill2 skill3 abandon_custom 0
I solved it!我解决了!
The solution was to use a regex extractor to create a new dataframe with just the values I was looking for, use get_dummies where needed, and then join that back to the main dataframe.解决方案是使用正则表达式提取器创建一个新的 dataframe,其中仅包含我正在寻找的值,在需要的地方使用 get_dummies,然后将其连接回主 dataframe。
skill_start = df['CM'].str.extract(r'skill_start=(?P<skill_start>.*?)\|\|')
surveys = df['CM'].str.extract(r'survey_response=(?P<survey_response>[1|2|3|4|5])')
skill_complete = df['CM'].str.extract(r'skill_complete=(?P<skill_complete>.*?)\|\|')
escalated_custom = df['CM'].str.extract(r'escalated_custom=(?P<escalated_custom>[0|1])')
abandoned_custom = df['CM'].str.extract(r'abandoned_custom=(?P<abandoned_custom>[0|1])')
skill_start = pd.concat([skill_start,skill_start.skill_start.str.get_dummies(sep=',')],1)
skill_start = skill_start.add_prefix('skill_start:')
skill_complete = pd.concat([skill_complete,skill_complete.skill_complete.str.get_dummies(sep=',')],1)
skill_complete = skill_complete.add_prefix('skill_complete:')
new_df = df.join(surveys).join(skill_start).join(skill_complete).join(escalated_custom).join(abandoned_custom)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.