简体   繁体   中英

Extract exact words or set of characters using Regex in Python

Suppose I have a list like this.

List = ['MX_QW-765', 'RUC_PO-345', 'RUC_POLO-209']. 

I want to search and return a match where 'PO' is there. Technically I should have RUC_PO-345 as my output, but even RUC_POLO-209 is getting returned as an output along with RUC_PO-345 .

Before updated question:

As per my comment, I think you are using the wrong approach. To me it seems you can simply use in :

words = ['cat', 'caterpillar', 'monkey', 'monk', 'doggy', 'doggo', 'dog']
if 'cat' in words:
    print("yes")
else:
    print("no")

Returns: yes

words = ['cats', 'caterpillar', 'monkey', 'monk', 'doggy', 'doggo', 'dog']
if 'cat' in words:
    print("yes")
else:
    print("no")

Returns: no


After updated question:

Now if your sample data does not actually reflect your needs but you are interested to find a substring within a list element, you could try:

import re
words = ['MX_QW-765', 'RUC_PO-345', 'RUC_POLO-209']
srch = 'PO'
r = re.compile(fr'(?<=_){srch}(?=-)')
print(list(filter(r.findall, words)))

Or using match :

import re
words = ['MX_QW-765', 'RUC_PO-345', 'RUC_POLO-209']
srch = 'PO'
r = re.compile(fr'^.*(?<=_){srch}(?=-).*$')
print(list(filter(r.match, words)))

This will return a list of items (in this case just ['RUC_PO-345'] ) that follow the pattern. I used the above regular pattern to make sure your searchvalue won't be at the start of the searchstrings, but would be after an underscore, and followed by a - .


Now if you have a list of products you want to find, consider the below:

import re
words = ['MX_QW-765', 'RUC_PO-345', 'RUC_POLO-209']
srch = ['PO', 'QW']
r = re.compile(fr'(?<=_)({"|".join(srch)})(?=-)')
print(list(filter(r.findall, words)))

Or again using match :

import re
words = ['MX_QW-765', 'RUC_PO-345', 'RUC_POLO-209']
srch = ['PO', 'QW']
r = re.compile(fr'^.*(?<=_)({"|".join(srch)})(?=-).*$')
print(list(filter(r.match, words)))

Both would return: ['MX_QW-765', 'RUC_PO-345']

Note that if you don't have f-strings supported you can also concat your variable into the pattern.

Try building a regex alternation using the search terms in the list:

words = ['cat', 'caterpillar', 'monkey', 'monk', 'doggy', 'doggo', 'dog']
your_text = 'I like cat, dog, rabbit, antelope, and monkey, but not giraffes'
regex = r'\b(?:' + '|'.join(words) + r')\b'
print(regex)
matches = re.findall(regex, your_text)
print(matches)

This prints:

\b(?:cat|caterpillar|monkey|monk|doggy|doggo|dog)\b
['cat', 'dog', 'monkey']

You can clearly see the regex alternation which we built to find all matching keywords.

The pattern:

‘_PO[^\w]’

should work with a re.search() or re.findall() call; it will not work with a re.match as it doesn't consider the characters at the beginning of the string.

The pattern reads: match 1 underscore ('_') followed by 1 capital P ('P') followed by 1 capital O ('O') followed by one character that is not a word character . The special character '\w' matches [a-zA-Z0-9_] .

‘_PO\W’

^ This might also be used as a shorter version to the first pattern suggested (credit @JvdV in comments)

‘_PO[^A-Za-z]’

This pattern uses the, 'Set of characters not alpha characters.' In the event the dash interferes with either of the first two patterns.

To use this to identify the pattern in a list, you can use a loop:

import re

For thing in my_list:
    if re.search(‘_PO[^\w]’, thing) is not None:
        # do something
        print(thing)

This will use the re.search call to match the pattern as the True condition in the if conditional. When re doesn't match a string, it returns None; hence the syntax of... if re.search() is not None .

Hope it helps!

You need to add a $ sign which signifies the end of a string, you can also add a ^ which is the start of a string so only cat matches:

 ^cat$

We can try matching one of the three exact words 'cat','dog','monk' in our regex string.

Our regex string is going to be "\b(?:cat|dog|monk)\b"

\b is used to define word boundary. We use \b so that we could search for whole words (this is the exact problem you were facing). Adding this would not match tomcat or caterpillar and only cat

Next, (?:) is called Non capturing group (Explained here )

Now we need to match either one of cat or dog or monk . So this is expressed as cat|dog|monk . In python 3 this would be:

import re

words = ['cat', 'caterpillar', 'monkey', 'monk', 'doggy', 'doggo', 'dog']
regex = r"\b(?:cat|dog|monk)\b"

r=re.compile(regex)
matched = list(filter(r.match, words))

print(matched)

To implement matching regex through an iterable list, we use filter function as mentioned in a Stackoverflow answer here

You can find the runnable Python code here

NOTE: Finally, regex101 is a great online tool to try out different regex strings and get their explanation in real-time. The explanation for our regex string is here

You should be using a regular expression ( import re ) , and this is the regular expression you should be using: r'(?<?[A-Za-z0-9])PO(?![A-Za-z0-9])' .

I previously recommended the \b special sequence, but it turns out the '_' is considered part of a word, and that isn't the case for you, so it wouldn't work.

This leaves you with the somewhat more complex negative look behind and negative lookahead assertions, which is what (?<! ... and (?! ... are, respectively. To understand how those work, read the documentation for Python regular expressions.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM