简体   繁体   中英

Extract all strings from a line excluding multiple regex patterns matches

I have these regex patterns which I use to extract specific strings from texts. I am using python3

'\d{2}\/\d{2} ' - Extract date dd/mm

'\S+\.\d\d' - Extract amounts with 2 decimals

' \d{6} ' - Extract ref no, 6 digits

Now I want to extract whatever is left after extracting these data(example from sample: - "DUITNOW TRSF XXuu9876 CR ANG BENG KHOON").

What kind of regex pattern should I write?

Sample text -

"31/12 DUITNOW TRSF XXuu9876 CR 004085 ANG BENG KHOON 40,000.00 2,059,044.30"

Appreciate your help. Thanks

Try this way.

import re
s = "31/12 DUITNOW TRSF CR 004085 ANG BENG KHOON 40,000.00 2,059,044.30"
print(s)
s1 = re.sub('\d{2}\/\d{2} ', '', s)
print(s1)
s2 = re.sub('\S+\.\d\d', '', s1)
print(s2)
s3 = re.sub('\d{6}', '', s2)
print(s3)
s3 = 'DUITNOW TRSF CR  ANG BENG KHOON'

You can use the patterns you have to re.split the string (I have revamped the pattern a bit though):

import re
p = r'\s*(?:\d{2}\/\d{2}(?!\S)|\S+\.\d\d|(?<!\S)\d{6}(?!\S))\s*'
text = "31/12 DUITNOW TRSF CR 004085 ANG BENG KHOON 40,000.00 2,059,044.30"
print( list(filter(None, re.split(p, text))) )
# => ['DUITNOW TRSF CR', 'ANG BENG KHOON']
print( " ".join(re.split(p, text)).strip() )
# => DUITNOW TRSF CR ANG BENG KHOON

See the regex and the Python demos .

Note the patterns are combined into a single pattern of the \s*(?:...|...|etc.)\s* type, ie a non-capturing group with optional whitespace patterns on both ends. The (?<!\S) and (?!\S) are whitespace boundaries .

Since there may be empty strings resulting from matches at the start or end of string and in case of consecutive matches, the resulting list must be filtered from empty matches.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM