简体   繁体   中英

Python regex: removing all special characters and numbers NOT attached to words

I am trying to remove all special characters and numbers in python, except numbers that are directly attached to words.

I have succeded in doing this for all cases of special characters and numbers attached and not attached to words, how to do it in such a way that numbers attached are not matched.

Here's what I did:

import regex as re
string = "win32 backdoor guid:64664646 DNS-lookup h0lla"
re.findall(r'[^\p{P}\p{S}\s\d]+', string.lower())

I get as output

win backdoor guid DNS lookup h lla

But I want to get:

win32 backdoor guid DNS lookup h0lla

demo: https://regex101.com/r/x4HrGo/1

To match alphanumeric strings or only letter words you may use the following pattern with re :

import re
# ...
re.findall(r'(?:[^\W\d_]+\d|\d+[^\W\d_])[^\W_]*|[^\W\d_]+', text.lower())

See the regex demo .

Details

  • (?:[^\W\d_]+\d|\d+[^\W\d_])[^\W_]* - either 1+ letters followed with a digit, or 1+ digits followed with a letter, and then 0+ letters/digits
  • | - or
  • [^\W\d_]+ - either any 1+ Unicode letters

NOTE It is equivalent to \d*[^\W\d_][^\W_]* pattern posted by PJProudhon , that matches any 1+ alphanumeric character chunks with at least 1 letter in them.

You could give a try to \b\d*[^\W\d_][^\W_]*\b

Decomposition:

\b       # word boundary
/d*      # zero or more digits
[^\W\d_] # one alphabetic character
[^\W_]*  # zero or more alphanumeric characters
\b       # word boundary

For beginners:

[^\W] is typical double negated construct. Here you want to match any character which is not alphanumeric or _ ( \W is the negation of \w , which matches any alphanumeric character plus _ - common equivalent [a-zA-Z0-9_] ).

It reveals useful here to compose:

  • Any alphanumeric character = [^\W_] matches any character which is not non-[alphanumeric or _ ] and is not _ .
  • Any alphabetic character = [^\W\d_] matches any character which is not non-[alphanumeric or _ ] and is not digit ( \d ) and is not _ .

Some further reading here .


Edit:
When _ is also considered a word delimiter, just skip the word boundaries, which toggle on that character, and use \d*[^\W\d_][^\W_]* .
Default greediness of star operator will ensure all relevant characters are actually matched.

Demo .

Try this RegEx instead:

([A-Za-z]+(\d)*[A-Za-z]*)

You can expand it from here, for example flipping the * and + on the first and last sets to capture string like "win32" and "01ex" equally.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM