简体   繁体   中英

Strip punctuation with regular expression - python

I would like to strip all of the the punctuations (except the dot) from the beginning and end of a string, but not in the middle of it.

For instance for an original string:

@#%%.Hol$a.A.$%

I would like to get the word .Hol$aA removed from the end and beginning but not from the middle of the word.

Another example could be for the string:

@#%%...&Hol$a.A....$%

In this case the returned string should be ..&Hol$aA... because we do not care if the allowed characters are repeated.

The idea is to remove all of the punctuations( except the dot ) just at the beginning and end of the word. A word is defined as \\w and/or a .

A practical example is the string 'Barnes&Nobles' . For text analysis is important to recognize Barnes&Nobles as a single entity, but without the '

How to accomplish the goal using Regex?

Use this simple and easily adaptable regex:

[\w.].*[\w.]

It will match exactly your desired result, nothing more.

  • [\\w.] matches any alphanumeric character and the dot
  • .* matches any character (except newline normally)
  • [\\w.] matches any alphanumeric character and the dot

To change the delimiters, simply change the set of allowed characters inside the [] brackets.

Check this regex out on regex101.com

import re
data = '@#%%.Hol$a.A.$%'
pattern = r'[\w.].*[\w.]'
print(re.search(pattern, data).group(0))
# Output: .Hol$a.A.

Depending on what you mean with striping the punctuation, you can adapt the following code :

import re
res = re.search(r"^[^.]*(.[^.]*.([^.]*.)*?)[^.]*$", "@#%%.Hol$a.A.$%")
mystr = res.group(1)

This will strip everything before and after the dot in the expression. Warning, you will have to check if the result is different of None, if the string doesn't match.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM