简体   繁体   English

提取并替换字符之间的字符串

[英]extracting and substituting string between characters

I have cleaned my data which was encoded in 'utf-8' . 我已经清理了编码为'utf-8' Using .str.extract() , I got the text enclosed between [(u'text')] to 'text format however, my code is not registering garbage/unicode characters "\\u09xx" and similar type of literals. 使用.str.extract() ,我得到了[(u'text')]'text格式之间'text但是,我的代码没有注册垃圾代码/ Unicode字符"\\u09xx" 和类似类型的文字。 How should I remove them ? 我应该如何删除它们?

Input : 输入:

{"HT" : ["([u'SoccerTips', u'FootballTips'],)", "([u'\u092b\u094c\u091c\u0940', u'FixedMatch', u'CT2017Final'],)"]}

My code : 我的代码:

df1 = df.drop('HT', axis=1).join(
             df.HT
             .str
             .split(expand=True)
             .stack()
             .reset_index(drop=True, level=1)
             .rename('HT')           
             )

df1['HT'] = df1['HT'].str.extract("u+(\'[^\']*)", expand=False).fillna('')
df1['HT'] = "#" + df1['HT']

Output :- 输出:-

{"HT" : ["#'SoccerTips" , "#'FootballTips", "#'\u092b\u094c\u091c\u0940", "#'FixedMatch", "#'CT2017Final"]}

Expected Output :- 预期产量:-

{"HT" : ["#SoccerTips" , "#FootballTips", " ", "#FixedMatch", "#CT2017Final"]}

A possible solution: 可能的解决方案:

import pandas as pd

# the input
df1= {"HT" : ["([u'SoccerTips', u'FootballTips'],)", "([u'\u092b\u094c\u091c\u0940', u'FixedMatch', u'CT2017Final'],)"]}

# convert to Dataframe
df1= pd.DataFrame(df1)

# cleaning
df1.HT.replace('\(\[|\],\)','', regex=True, inplace=True)
df1.HT.replace("u'[^\x00-\x7f]*'","", regex=True, inplace=True)
df1.HT.replace("u'([^\']+)'",'#\\1', regex=True, inplace= True)
df1.HT= df1.HT.str.split(', ')

# final result
df1= {'HT':[j for i in df1.HT for j in i]}

# output: df1 -> {'HT': ['#SoccerTips', '#FootballTips', '', '#FixedMatch', '#CT2017Final']}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM