简体   繁体   中英

How to find next 9 characters after a string ignoring special characters?

Consider the following string:

str_test = 'This is a sample text NRC234456789 and this is another case AZN.1.Z.3.4.S.6.7.8.9 and this another case BSA 123 456 789 and final case SSR/789456123'

Basically, I need to find the places in the string where the characters 'NRC' , 'AZN', 'BSA' and 'SSR'. Then, I need to extract the next following 9 numbers..ignoring any non number character. So it should return

In some cases, the number 5 is written wrongly as an S and the number 2 is written as a Z. I still need to identify these cases and change the wrong S and Z for a 5 and 2 respectively.

result = ['NRC234456789', 'AZN123456789' , 'BSA123456789', 'SSR789456123']

I have this code that I am working with

list_comb = ['NRC', 'AZN', 'BSA', 'SSR'] 
def findWholeWord(w): 
    return re.compile(r'\b({0})\b'.format(w), flags=re.IGNORECASE).search 

It returns the position where the strings are found..but I am not sure how to proceed next. Thanks

Use this regex to recognize the pattern. Maybe it can help:

import re

str_test = 'This is a sample text NRC234456789 and this is another case AZN.1.2.3.4.5.6.7.8.9 and this another case BSA 123 456 789 and final case SSR/789456123'
regex = re.findall("([A-Z0-9.\s\/]{2,})",str_test)
result = []

One solution if the non digit character only dot, comma, and slash:

for r in regex:
    result.append(r.replace(".","").replace(" ","").replace("/",""))
print (result)

Or use this loop if the non digit character can be any:

for r in regex:
    result.append(re.sub("([^\d\w])","",r))
print (result)

Output:

['NRC234456789', 'AZN123456789', 'BSA123456789', 'SSR789456123']

UPDATED

import re

str_test = 'This is a sample text NRC234456789 and this is another case AZN.1.Z.3.4.S.6.7.8.9 and this another case BSA 123 456 789 and final case SSR/789456123'
regex = re.findall("([A-Z]{3})([A-Z0-9.\s\/]{2,})",str_test)
result = []
for r in regex:
    result.append(r[0]+("".join(re.sub("([^\d\w])","",str(r[1])).replace("Z","2").replace("S","5"))))

print (result)

Output:

['NRC234456789', 'AZN123456789', 'BSA123456789', 'SSR789456123']

This is one approach

Ex:

import re

str_test = 'This is a sample text NRC234456789 and this is another case AZN.1.2.3.4.5.6.7.8.9 and this another case BSA 123 456 789 and final case SSR/789456123'
to_check = ['NRC', 'AZN', 'BSA', 'SSR']
pattern = re.compile("("+"|".join(to_check) + ")([\d+\.\s\/]+)")

for k, v in pattern.findall(str_test):
    print(k + re.sub(r"[^\d]", "", v))

Output:

NRC234456789
AZN123456789
BSA123456789
SSR789456123

Edit as per comment.

import re

str_test = 'This is a sample text NRC234456789 and this is another case AZN.1.Z.3.4.S.6.7.8.9 and this another case BSA 123 456 789 and final case SSR/789456123'
to_check = ['NRC', 'AZN', 'BSA', 'SSR']
pattern = re.compile("("+"|".join(to_check) + ")([\d+\.\s\/ZS]+)")

for k, v in pattern.findall(str_test):
    new_val = k + re.sub(r"[^\d]", "", v.replace("Z", "2").replace("S", "5"))
    print(new_val)

Here is a simple approach to first find the intended text using this regex,

\b(?:NRC|AZN|BSA|SSR)(?:.?\d)+

generated dynamically using the supplied list and then remove any non-alphanumeric characters from it.

Edit: For handling erroneous strings where 2 is mistakenly written as Z and 5 is written as S , you can replace them on the second part of string ignoring initial three characters. Also, code updated so it only picks next nine digits instead of more. Here is my updated Python code for same,

import re

s = 'This is a sample text NRC234456789 and this is another case AZN.1.Z.3.4.S.6.7.8.9 and this another case BSA 123 456 789 and BSA 123 456 789 123 456 final case SSR/789456123'

list_comb = ['NRC', 'AZN', 'BSA', 'SSR']
regex = r'\b(?:{})(?:.?[\dA-Z])+'.format('|'.join(list_comb))
print(regex)

for m in re.findall(regex, s):
 m = re.sub(r'[^a-zA-Z0-9]+', '', m)
 mat = re.search(r'^(.{3})(.{9})', m)
 if mat:
  s1 = mat.group(1)
  s2 = mat.group(2).replace('S','5').replace('Z','2')
  print(s1+s2)

Prints corrected value where S is replaced with 5 and Z with 2 ,

NRC234456789
AZN123456789
BSA123456789
BSA123456789
SSR789456123

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM