Consider the following string:
str_test = 'This is a sample text NRC234456789 and this is another case AZN.1.Z.3.4.S.6.7.8.9 and this another case BSA 123 456 789 and final case SSR/789456123'
Basically, I need to find the places in the string where the characters 'NRC' , 'AZN', 'BSA' and 'SSR'. Then, I need to extract the next following 9 numbers..ignoring any non number character. So it should return
In some cases, the number 5 is written wrongly as an S and the number 2 is written as a Z. I still need to identify these cases and change the wrong S and Z for a 5 and 2 respectively.
result = ['NRC234456789', 'AZN123456789' , 'BSA123456789', 'SSR789456123']
I have this code that I am working with
list_comb = ['NRC', 'AZN', 'BSA', 'SSR']
def findWholeWord(w):
return re.compile(r'\b({0})\b'.format(w), flags=re.IGNORECASE).search
It returns the position where the strings are found..but I am not sure how to proceed next. Thanks
Use this regex
to recognize the pattern. Maybe it can help:
import re
str_test = 'This is a sample text NRC234456789 and this is another case AZN.1.2.3.4.5.6.7.8.9 and this another case BSA 123 456 789 and final case SSR/789456123'
regex = re.findall("([A-Z0-9.\s\/]{2,})",str_test)
result = []
One solution if the non digit character only dot, comma, and slash:
for r in regex:
result.append(r.replace(".","").replace(" ","").replace("/",""))
print (result)
Or use this loop if the non digit character can be any:
for r in regex:
result.append(re.sub("([^\d\w])","",r))
print (result)
Output:
['NRC234456789', 'AZN123456789', 'BSA123456789', 'SSR789456123']
UPDATED
import re
str_test = 'This is a sample text NRC234456789 and this is another case AZN.1.Z.3.4.S.6.7.8.9 and this another case BSA 123 456 789 and final case SSR/789456123'
regex = re.findall("([A-Z]{3})([A-Z0-9.\s\/]{2,})",str_test)
result = []
for r in regex:
result.append(r[0]+("".join(re.sub("([^\d\w])","",str(r[1])).replace("Z","2").replace("S","5"))))
print (result)
Output:
['NRC234456789', 'AZN123456789', 'BSA123456789', 'SSR789456123']
This is one approach
Ex:
import re
str_test = 'This is a sample text NRC234456789 and this is another case AZN.1.2.3.4.5.6.7.8.9 and this another case BSA 123 456 789 and final case SSR/789456123'
to_check = ['NRC', 'AZN', 'BSA', 'SSR']
pattern = re.compile("("+"|".join(to_check) + ")([\d+\.\s\/]+)")
for k, v in pattern.findall(str_test):
print(k + re.sub(r"[^\d]", "", v))
Output:
NRC234456789
AZN123456789
BSA123456789
SSR789456123
Edit as per comment.
import re
str_test = 'This is a sample text NRC234456789 and this is another case AZN.1.Z.3.4.S.6.7.8.9 and this another case BSA 123 456 789 and final case SSR/789456123'
to_check = ['NRC', 'AZN', 'BSA', 'SSR']
pattern = re.compile("("+"|".join(to_check) + ")([\d+\.\s\/ZS]+)")
for k, v in pattern.findall(str_test):
new_val = k + re.sub(r"[^\d]", "", v.replace("Z", "2").replace("S", "5"))
print(new_val)
Here is a simple approach to first find the intended text using this regex,
\b(?:NRC|AZN|BSA|SSR)(?:.?\d)+
generated dynamically using the supplied list and then remove any non-alphanumeric characters from it.
Edit: For handling erroneous strings where 2
is mistakenly written as Z
and 5
is written as S
, you can replace them on the second part of string ignoring initial three characters. Also, code updated so it only picks next nine digits instead of more. Here is my updated Python code for same,
import re
s = 'This is a sample text NRC234456789 and this is another case AZN.1.Z.3.4.S.6.7.8.9 and this another case BSA 123 456 789 and BSA 123 456 789 123 456 final case SSR/789456123'
list_comb = ['NRC', 'AZN', 'BSA', 'SSR']
regex = r'\b(?:{})(?:.?[\dA-Z])+'.format('|'.join(list_comb))
print(regex)
for m in re.findall(regex, s):
m = re.sub(r'[^a-zA-Z0-9]+', '', m)
mat = re.search(r'^(.{3})(.{9})', m)
if mat:
s1 = mat.group(1)
s2 = mat.group(2).replace('S','5').replace('Z','2')
print(s1+s2)
Prints corrected value where S
is replaced with 5
and Z
with 2
,
NRC234456789
AZN123456789
BSA123456789
BSA123456789
SSR789456123
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.