简体   繁体   中英

Python how to separate punctuation from text

So I want to separate group of punctuation from the text with spaces.

my_text = "!where??and!!or$$then:)"

I want to have a ! where ?? and !! or $$ then :) ! where ?? and !! or $$ then :) ! where ?? and !! or $$ then :) as a result.

I wanted something like in Javascript, where you can use $1 to get your matching string. What I have tried so far:

my_matches = re.findall('[!"\$%&\'()*+,\-.\/:;=#@?\[\\\]^_`{|}~]*', my_text)

Here my_matches is empty so I had to delete \\\\\\ from the expression:

my_matches = re.findall('[!"\$%&\'()*+,\-.\/:;=#@?\^_`{|}~]*', my_text)

I have this result:

['!', '', '', '', '', '', '??', '', '', '', '!!', '', '', '$$', '', '', '', '',
':)', '']

So I delete all the redundant entry like this:

my_matches_distinct = list(set(my_matches))

And I have a better result:

['', '??', ':)', '$$', '!', '!!']

Then I replace every match by himself and space:

for match in my_matches:
if match != '':
    my_text = re.sub(match, ' ' + match + ' ', my_text)

And of course it's not working ! I tried to cast the match as a string, but it's not working either... When I try to put directly the string to replace it's working though.

But I think I'm not doing it right, because I will have problems with '!' et '!!' right?

Thanks :)

It is recommended to use raw string literals when defining a regex pattern. Besides, do not escape arbitrary symbols inside a character class, only \\ must be always escaped, and others can be placed so that they do not need escaping. Also, your regex matches an empty string - and it does - due to * . Replace with + quantifier. Besides, if you want to remove these symbols from your string, use re.sub directly.

import re
my_text = "!where??and!!or$$then:)"
print(re.sub(r'[]!"$%&\'()*+,./:;=#@?[\\^_`{|}~-]+', r' \g<0> ', my_text).strip())

See the Python demo

Details : The []!"$%&\\'()*+,./:;=#@?[\\^_`{|}~-]+ matches any 1+ symbols from the set (note that only \\ is escaped here since - is used at the end, and ] at the start of the class), and the replacement inserts a space + the whole match (the \\g<0> is the backreference to the whole match) and a space. And .strip() will remove leading/trailing whitespace after the regex finishes processing the string.

Use sub() method in re library. You can do this as follows,

import re
str = '!where??and!!or$$then:)'
print re.sub(r'([!@#%\^&\*\(\):;"\',\./\\]+)', r' \1 ', str).strip()

I hope this code should solve your problem. If you are obvious with regex then the regex part is not a big deal. Just it is to use the right function.

Hope this helps! Please comment if you have any queries. :)


References:

Python re library

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM