简体   繁体   中英

How to search all the alphanumeric sequences in a document using Regex in Python?

I am stuck with a problem in Regex where I need to search all the available alphanumeric sequences in a document. A document can have more than one such sequences. I am doing it in Python.

So for example if the document is like "some blah blah blah with id X12354, id 1234Z and id 12P555. All are 50 years old."

So the expected output should be:

X12354

1234Z

12P555

Summary : Both alphabets and numbers must be present in a string where sequence or length doesn't matter. This string can come multiple times in a document. And it can be anywhere.

I've tried several ways to sort out regex but it's becoming confusing every time. Thanks in advance.

You could match between word boundaries and use a positive lookahead to assert and uppercase character and a digit:

\\b(?=[AZ-0-9]*[AZ])(?=[AZ-0-9]*[0-9])[A-Z0-9]+\\b

That would match:

  • \\b Word boundary
  • (?= Positive lookahead that asserts what is on the right
    • [A-Z0-9] * Match zero or more times an uppercase character
    • [AZ] Match an uppercase character
  • ) Close positive lookahead
  • (?= Positive lookahead that asserts what is on the right
    • [A-Z0-9]* Match zero or more times an uppercase character
    • [0-9] Match a digit
  • ) Close positive lookahead
  • [A-Z0-9]+ Match one or more times an uppercase character or a digit
  • \\b Word boundary

So, in Python, that would be:

import re
s = "some blah blah blah with id X12354, id 1234Z and id 12P555. All are 50 years old."
re.findall(r'\b(?=[A-Z-0-9]*[A-Z])(?=[A-Z-0-9]*[0-9])[A-Z0-9]+\b', s)

giving:

['X12354', '1234Z', '12P555']

This detects if at least an alphabet and digit exists in every small chunk of string.

import re
from string import punctuation
s = "some blah blah blah with id X12354, id 1234Z and id 12P555. All are 50 years old."
ans = [v for v in re.split("[ " + punctuation + "]", s) 
       if any(c.isdigit() for c in v) and any(c.isalpha() for c in v)]
['X12354,', '1234Z', '12P555']

re.split("[ " + punctuation + "]", s) splits with all punctuation and space.

Use re.findall to get all matches. Use two lookaheads, one for verifying that the match contains a number, another one for verifying that it contains a letter.

document = "some blah blah blah with id X12354, id 1234Z and id 12P555. All are 50 years old."
matches = re.findall('(?=[a-z0-9]*[a-z])(?=[a-z0-9]*[0-9])[a-z0-9]+', document, re.IGNORECASE)
print(matches)

You can try the regex online here .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM