简体   繁体   中英

Python regex match string of 8 characters that contain both alphabets and numbers

I am trying to match a string of length 8 containing both numbers and alphabets(cannot have just numbers or just alphabets)using re.findall . The string can start with either letter or alphabet followed by any combination.

eg-

Input String: The reference number is 896av6uf and not 87987647 or ahduhsjs or hn0.

Output: ['896av6uf','a96bv6u0']

I came up with this regex r'([az]+[\\d]+[\\w]*|[\\d]+[az]+[\\w]*)' however it is giving me strings with less than 8 characters as well. Need to modify the regex to return strings with exactly 8 chars that contain both letters and alphabets.

You can use

\b(?=[a-zA-Z]*[0-9])(?=[0-9]*[a-zA-Z])[a-zA-Z0-9]{8}\b
\b(?=[^\W\d_]*\d)(?=\d*[^\W\d_])[^\W_]{8}\b

The first one only supports ASCII letters, while the second one supports all Unicode letters and digits since [^\\W\\d_] matches any Unicode letter and \\d matches any Unicode digit (as the re.UNICODE option is used by default in Python 3.x).

Details:

  • \\b - a word boundary
  • (?=[a-zA-Z]*[0-9]) - after any 0+ ASCII letters, there must be a digit
  • (?=[0-9]*[a-zA-Z]) - after any 0+ digits, there must be an ASCII letter
  • [a-zA-Z0-9]{8} - eight ASCII alphanumeric chars
  • \\b - a word boundary

You can use \\b\\w{8}\\b

It does not guarantee that you will have both digits AND letters, but does guarantee that you will have exactly eight characters, surrounded by word boundaries (eg whitespace, start/end of line).

You can try it in one of the online playgrounds such as this one: https://regex101.com/

在此处输入图片说明

The meat of the matching is done with the \\w{8} which means 8 letters/words (including capitals and underscore). \\b means "word boundary"

If you want only digits and lowercase letters, replace this by \\b[a-z0-9]{8}\\b

You can then further check for existence of both digits AND letter, eg by using filter :

list(filter(lambda s: re.search(r'[0-9]', s) and re.search(r'[az]', s), result))

result is what you get from re.findall() .

So bottom line, I would use:

list(filter(lambda s: re.search(r'[0-9]', s) and re.search(r'[az]', s), re.findall(r'\\b[a-z0-9]{8}\\b', str)))

First, let's find statement that finds words made of lowercase letters and digits that are 8 characters long:

\b[a-z\d]{8}\b

Next condition is that the word must contain both letters and numbers:

[a-d]\d

Now for the challenging part, combining these into one statement. Easiest way might be to just spit them up but we can use some look-aheads to get this to work:

\b(?=.*[a-z]\d)[a-z\d]{8}\b

Im sure there a tidier way of doing this but this will work.

A more compact solution than others have suggested is this:

((?![A-Za-z]{8}|[0-9]{8})[0-9A-Za-z]{8})

This guarantees that the found matches are 8 characters in length and that they can not be only numeric or only alphabets.

Breakdown:

  • (?![A-Za-z]{8}|[0-9]{8}) = This is a negative lookahead that means the match can't be a string of 8 numbers or 8 alphabets.
  • [0-9A-Za-z]{8} = Simple regex saying the input needs to be alphanumeric of 8 characters in length.

Test Case:

Input: 12345678 abcdefgh i8D0jT5Yu6Ms1GNmrmaUjicc1s9D93aQBj3WWWjww54gkiKqOd7Ytkl0MliJy9xadAgcev8b2UKdfGRDOpxRPm30dw9GeEz3WPRO 1234567890987654321 qwertyuiopasdfghjklzxcvbnm

import re

pattern = re.compile(r'((?![A-Za-z]{8}|\d{8})[A-Za-z\d]{8})')

test = input()
match = pattern.findall(test)
print(match)

Output: ['i8D0jT5Y', 'u6Ms1GNm', 'maUjicc1', 's9D93aQB', 'j3WWWjww', '54gkiKqO', 'd7Ytkl0M', 'liJy9xad', 'Agcev8b2', 'DOpxRPm3', '0dw9GeEz']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM