简体   繁体   中英

Regex - Find successive 'words' containing at least 1 capital letter, one digit or one special character

I am trying to extract sequences of words containing at least 1 item of the following:

  • Uppercase character.
  • Digit
  • ':' or '-'

For example for the following phrase:

  • aBC has been contacting Maria and James where their DDD Code for system DB-54:ABB is 12343-4.

I would like to extract the following items:

  • aBC
  • Maria
  • James
  • DDD Code
  • DB-54:ABB
  • 12343-4

So far, I have the following code:

import re
re.findall(r'((\S*[A-Z|0-9|\:|\-]\w*)([\, |\.])?)', 'aBC has been contacting Maria and ere our DDD Code for system DB-54:ABB is 12343-4.')

Which returns:

[('aBC ', 'aBC', ' '),
 ('Maria ', 'Maria', ' '),
 ('DDD ', 'DDD', ' '),
 ('Code ', 'Code', ' '),
 ('DB-54:ABB ', 'DB-54:ABB', ' '),
 ('12343-4.', '12343-4', '.')]

This returns all of the desired items except that it is splitting DDD and Code. My goal is to group together consecutive words containing the items mentioned above. 'DDD' 'Code' both contain a capital letter and are consecutive, therefore they should belong to the same string

You could add + to repeat the pattern. I simplified it some since you used backslashes where it's not needed. This will result in the 6 capture groups you want:

((\S*[A-Z0-9:-]\w*)($|[ ,.]))+

Demo

Put into code:

import re

m = re.findall(r'(((\S*[A-Z0-9:-]\w*)($|[ ,.]))+)',
               'aBC has been contacting Maria and James where their DDD Code for system DB-54:ABB is 12343-4.')

print(m)

Output:

[('aBC ', 'aBC ', 'aBC', ' '),
 ('Maria ', 'Maria ', 'Maria', ' '),
 ('James ', 'James ', 'James', ' '),
 ('DDD Code ', 'Code ', 'Code', ' '),
 ('DB-54:ABB ', 'DB-54:ABB ', 'DB-54:ABB', ' '),
 ('12343-4.', '12343-4.', '12343-4', '.')]

This doesn't split consecutive matches

result = re.findall(r'(?:[\w0-9]*[A-Z0-9\-:]+[\w0-9]*\s*)+', text)

But you may have to strip the whitespaces

map(str.strip, result)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM