简体   繁体   中英

remove only consecutive special characters but keep consecutive [a-zA-Z0-9] and single characters

How can I remove multiple consecutive occurrences of all the special characters in a string?

I can get the code like:

re.sub('\.\.+',' ',string)
re.sub('@@+',' ',string)
re.sub('\s\s+',' ',string)

for individual and in best case, use a loop for all the characters in a list like:

from string import punctuation

for i in punctuation:
    to = ('\\' + i + '\\' + i + '+')
    string = re.sub(to, ' ', string)

but I'm sure there is an effective method too.

I tried:

re.sub('[^a-zA-Z0-9][^a-zA-Z0-9]+', ' ', '\n\n.AAA.x.@@+*@#=..xx000..x..\t.x..\nx*+Y.')

but it removes all the special characters except one preceded by alphabets.

string can have different consecutive special characters like 99@aaaa*!@#$. but not same like ++--... .

A pattern to match all non-alphanumeric characters in Python is [\\W_] .

So, all you need is to wrap the pattern with a capturing group and add \\1+ after it to match 2 or more consecutive occurrences of the same non-alphanumeric characters:

text = re.sub(r'([\W_])\1+',' ',text)

In Python 3.x, if you wish to make the pattern ASCII aware only, use the re.A or re.ASCII flag:

text = re.sub(r'([\W_])\1+',' ',text, flags=re.A)

Mind the use of the r prefix that defines a raw string literal (so that you do not have to escape \\ char).

See the regex demo . See the Python demo :

import re
text = "\n\n.AAA.x.@@+*@#=..xx000..x..\t.x..\nx*+Y."
print(re.sub(r'([\W_])\1+',' ',text))

Output:

 .AAA.x. +*@#= xx000 x  .x 
x*+Y.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM