简体   繁体   中英

Regex for matching various forms of strings

Let's say the input string is

s_in = 'auto encoder'

and the list of strings is

l_s = ['autoencoder', 'auto-encoder', 'auto', 'one']

My goal is to match s_in with its possible forms in l_s so that in return ill get all matched strings from the list.

In the example above the output must be ['autoencoder', 'auto-encoder']

Another example:

s_in = 'autoencoder'    
l_s = ['auto-encoder', 'auto encoder', 'auto', 'one']

Output: ['auto-encoder', 'auto encoder']

Or

s_in = 'auto-encoder'    
l_s = ['autoencoder', 'auto encoder', 'auto', 'one']

Output: ['autoencoder', 'auto encoder']

The regex I constructed looks like this:

re.match(r'^[a-zA-Z]+(?:(?:\s[a-zA-Z]+)+|(?:\-[a-zA-Z]+)|(?:[a-zA-Z]+))$', s)

It works well if I just iterate over list items, but doesn't work if I try to combine input string and list of strings.

You can compare the strings after removing all special characters, say, with [\\W_]+ pattern:

import re
s_in = 'auto encoder'
l_s = ['autoencoder', 'auto-encoder', 'auto', 'one']

rx = re.compile(r'[\W_]+')  # Define the regex for non-alnum chars
s_check = rx.sub('', s_in)  # Input string without non-alnum chars
print( [x for x in l_s if s_check == rx.sub('', x)] ) # Print if equal after removing all non-alnum chars
# => ['autoencoder', 'auto-encoder']

See the Python demo .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM