How to find next 9 characters after a string ignoring special characters?

Question

Consider the following string:

str_test = 'This is a sample text NRC234456789 and this is another case AZN.1.Z.3.4.S.6.7.8.9 and this another case BSA 123 456 789 and final case SSR/789456123'

Basically, I need to find the places in the string where the characters 'NRC' , 'AZN', 'BSA' and 'SSR'. Then, I need to extract the next following 9 numbers..ignoring any non number character. So it should return

In some cases, the number 5 is written wrongly as an S and the number 2 is written as a Z. I still need to identify these cases and change the wrong S and Z for a 5 and 2 respectively.

result = ['NRC234456789', 'AZN123456789' , 'BSA123456789', 'SSR789456123']

I have this code that I am working with

list_comb = ['NRC', 'AZN', 'BSA', 'SSR'] 
def findWholeWord(w): 
    return re.compile(r'\b({0})\b'.format(w), flags=re.IGNORECASE).search

It returns the position where the strings are found..but I am not sure how to proceed next. Thanks

Answer 1

Use this regex to recognize the pattern. Maybe it can help:

import re

str_test = 'This is a sample text NRC234456789 and this is another case AZN.1.2.3.4.5.6.7.8.9 and this another case BSA 123 456 789 and final case SSR/789456123'
regex = re.findall("([A-Z0-9.\s\/]{2,})",str_test)
result = []

One solution if the non digit character only dot, comma, and slash:

for r in regex:
    result.append(r.replace(".","").replace(" ","").replace("/",""))
print (result)

Or use this loop if the non digit character can be any:

for r in regex:
    result.append(re.sub("([^\d\w])","",r))
print (result)

Output:

['NRC234456789', 'AZN123456789', 'BSA123456789', 'SSR789456123']

UPDATED

import re

str_test = 'This is a sample text NRC234456789 and this is another case AZN.1.Z.3.4.S.6.7.8.9 and this another case BSA 123 456 789 and final case SSR/789456123'
regex = re.findall("([A-Z]{3})([A-Z0-9.\s\/]{2,})",str_test)
result = []
for r in regex:
    result.append(r[0]+("".join(re.sub("([^\d\w])","",str(r[1])).replace("Z","2").replace("S","5"))))

print (result)

Output:

['NRC234456789', 'AZN123456789', 'BSA123456789', 'SSR789456123']

Answer 2

This is one approach

Ex:

import re

str_test = 'This is a sample text NRC234456789 and this is another case AZN.1.2.3.4.5.6.7.8.9 and this another case BSA 123 456 789 and final case SSR/789456123'
to_check = ['NRC', 'AZN', 'BSA', 'SSR']
pattern = re.compile("("+"|".join(to_check) + ")([\d+\.\s\/]+)")

for k, v in pattern.findall(str_test):
    print(k + re.sub(r"[^\d]", "", v))

Output:

NRC234456789
AZN123456789
BSA123456789
SSR789456123

Edit as per comment.

import re

str_test = 'This is a sample text NRC234456789 and this is another case AZN.1.Z.3.4.S.6.7.8.9 and this another case BSA 123 456 789 and final case SSR/789456123'
to_check = ['NRC', 'AZN', 'BSA', 'SSR']
pattern = re.compile("("+"|".join(to_check) + ")([\d+\.\s\/ZS]+)")

for k, v in pattern.findall(str_test):
    new_val = k + re.sub(r"[^\d]", "", v.replace("Z", "2").replace("S", "5"))
    print(new_val)

Answer 3

Here is a simple approach to first find the intended text using this regex,

\b(?:NRC|AZN|BSA|SSR)(?:.?\d)+

generated dynamically using the supplied list and then remove any non-alphanumeric characters from it.

Edit: For handling erroneous strings where 2 is mistakenly written as Z and 5 is written as S , you can replace them on the second part of string ignoring initial three characters. Also, code updated so it only picks next nine digits instead of more. Here is my updated Python code for same,

import re

s = 'This is a sample text NRC234456789 and this is another case AZN.1.Z.3.4.S.6.7.8.9 and this another case BSA 123 456 789 and BSA 123 456 789 123 456 final case SSR/789456123'

list_comb = ['NRC', 'AZN', 'BSA', 'SSR']
regex = r'\b(?:{})(?:.?[\dA-Z])+'.format('|'.join(list_comb))
print(regex)

for m in re.findall(regex, s):
 m = re.sub(r'[^a-zA-Z0-9]+', '', m)
 mat = re.search(r'^(.{3})(.{9})', m)
 if mat:
  s1 = mat.group(1)
  s2 = mat.group(2).replace('S','5').replace('Z','2')
  print(s1+s2)

Prints corrected value where S is replaced with 5 and Z with 2 ,

NRC234456789
AZN123456789
BSA123456789
BSA123456789
SSR789456123

How to find next 9 characters after a string ignoring special characters?

Question

3 answers

solution1
0 2019-04-26 11:29:51

solution2
0 2019-04-26 11:31:59

solution3
0 ACCPTED 2019-04-26 11:44:36

How to find next 9 characters after a string ignoring special characters?

Question

3 answers

solution1 0 2019-04-26 11:29:51

solution2 0 2019-04-26 11:31:59

solution3 0 ACCPTED 2019-04-26 11:44:36

solution1
0 2019-04-26 11:29:51

solution2
0 2019-04-26 11:31:59

solution3
0 ACCPTED 2019-04-26 11:44:36