I would like to normalize strings of text; and for that reason I want to keep punctuation marks and non-alphabetic characters (not to detete emoticons), but at the same time make a blank space between every two alphabetic and non-alphabetic characters. For example the following strings:
"*I love u*"
"Hi, life is great:)hehe"
"I will go uni.cul"
should be converted to:
"* I love u *"
"Hi , life is great :) hehe"
"I will go to uni . cul"
Could you please tell me how I can write a regular expresion to do this? Thanks in advance.
You can replace the matches of this expression:
(?<=[^\w\s])(?=\w)|(?<=\w)(?=[^\w\s])
with a space .
For example:
re.sub(r'(?<=[^\w\s])(?=\w)|(?<=\w)(?=[^\w\s])', ' ', str)
Try this:
x = '''*I love u*
Hi, life is great:)hehe
I will go uni.cul'''
def rep(matchobj):
return ' ' + matchobj.group(0) + ' '
print re.sub('[^a-zA-Z0-9\s]+', rep, x).strip()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.