简体   繁体   English

在 Python 中使用正则表达式提取准确的单词或字符集

[英]Extract exact words or set of characters using Regex in Python

Suppose I have a list like this.假设我有一个这样的列表。

List = ['MX_QW-765', 'RUC_PO-345', 'RUC_POLO-209']. 

I want to search and return a match where 'PO' is there.我想搜索并返回“PO”所在的匹配项。 Technically I should have RUC_PO-345 as my output, but even RUC_POLO-209 is getting returned as an output along with RUC_PO-345 .从技术上讲,我应该将RUC_PO-345作为我的 output,但即使是RUC_POLO-209也会作为 output 与RUC_PO-345一起返回。

Before updated question:更新前的问题:

As per my comment, I think you are using the wrong approach.根据我的评论,我认为您使用了错误的方法。 To me it seems you can simply use in :对我来说,您似乎可以简单地使用in

words = ['cat', 'caterpillar', 'monkey', 'monk', 'doggy', 'doggo', 'dog']
if 'cat' in words:
    print("yes")
else:
    print("no")

Returns: yes回报: yes

words = ['cats', 'caterpillar', 'monkey', 'monk', 'doggy', 'doggo', 'dog']
if 'cat' in words:
    print("yes")
else:
    print("no")

Returns: no退货: no


After updated question:更新问题后:

Now if your sample data does not actually reflect your needs but you are interested to find a substring within a list element, you could try:现在,如果您的示例数据实际上并未反映您的需求,但您有兴趣在列表元素中找到 substring,您可以尝试:

import re
words = ['MX_QW-765', 'RUC_PO-345', 'RUC_POLO-209']
srch = 'PO'
r = re.compile(fr'(?<=_){srch}(?=-)')
print(list(filter(r.findall, words)))

Or using match :或使用match

import re
words = ['MX_QW-765', 'RUC_PO-345', 'RUC_POLO-209']
srch = 'PO'
r = re.compile(fr'^.*(?<=_){srch}(?=-).*$')
print(list(filter(r.match, words)))

This will return a list of items (in this case just ['RUC_PO-345'] ) that follow the pattern.这将返回遵循该模式的项目列表(在本例中为['RUC_PO-345'] )。 I used the above regular pattern to make sure your searchvalue won't be at the start of the searchstrings, but would be after an underscore, and followed by a - .我使用上述常规模式来确保您的搜索值不会位于搜索字符串的开头,而是位于下划线之后,然后是-


Now if you have a list of products you want to find, consider the below:现在,如果您有想要查找的产品列表,请考虑以下内容:

import re
words = ['MX_QW-765', 'RUC_PO-345', 'RUC_POLO-209']
srch = ['PO', 'QW']
r = re.compile(fr'(?<=_)({"|".join(srch)})(?=-)')
print(list(filter(r.findall, words)))

Or again using match :或再次使用match

import re
words = ['MX_QW-765', 'RUC_PO-345', 'RUC_POLO-209']
srch = ['PO', 'QW']
r = re.compile(fr'^.*(?<=_)({"|".join(srch)})(?=-).*$')
print(list(filter(r.match, words)))

Both would return: ['MX_QW-765', 'RUC_PO-345']两者都会返回: ['MX_QW-765', 'RUC_PO-345']

Note that if you don't have f-strings supported you can also concat your variable into the pattern.请注意,如果您不支持 f 字符串,您也可以将变量连接到模式中。

Try building a regex alternation using the search terms in the list:尝试使用列表中的搜索词构建正则表达式替换:

words = ['cat', 'caterpillar', 'monkey', 'monk', 'doggy', 'doggo', 'dog']
your_text = 'I like cat, dog, rabbit, antelope, and monkey, but not giraffes'
regex = r'\b(?:' + '|'.join(words) + r')\b'
print(regex)
matches = re.findall(regex, your_text)
print(matches)

This prints:这打印:

\b(?:cat|caterpillar|monkey|monk|doggy|doggo|dog)\b
['cat', 'dog', 'monkey']

You can clearly see the regex alternation which we built to find all matching keywords.您可以清楚地看到我们为查找所有匹配关键字而构建的正则表达式替换。

The pattern:图案:

‘_PO[^\w]’

should work with a re.search() or re.findall() call;应该使用 re.search() 或 re.findall() 调用; it will not work with a re.match as it doesn't consider the characters at the beginning of the string.它不适用于 re.match 因为它不考虑字符串开头的字符。

The pattern reads: match 1 underscore ('_') followed by 1 capital P ('P') followed by 1 capital O ('O') followed by one character that is not a word character .该模式为:匹配1 个下划线('_') 后跟1 个大写 P ('P')后跟 1 个大写 O ('O') 后跟一个非单词字符 The special character '\w' matches [a-zA-Z0-9_] .特殊字符 '\w' 匹配[a-zA-Z0-9_]

‘_PO\W’

^ This might also be used as a shorter version to the first pattern suggested (credit @JvdV in comments) ^ 这也可以用作建议的第一个模式的较短版本(在评论中注明@JvdV)

‘_PO[^A-Za-z]’

This pattern uses the, 'Set of characters not alpha characters.'此模式使用“字符集而不是字母字符”。 In the event the dash interferes with either of the first two patterns.如果破折号干扰前两种模式中的任何一种。

To use this to identify the pattern in a list, you can use a loop:要使用它来识别列表中的模式,您可以使用循环:

import re

For thing in my_list:
    if re.search(‘_PO[^\w]’, thing) is not None:
        # do something
        print(thing)

This will use the re.search call to match the pattern as the True condition in the if conditional.这将使用re.search调用将模式匹配为if条件中的 True 条件。 When re doesn't match a string, it returns None;当 re 不匹配一个字符串时,它返回 None; hence the syntax of... if re.search() is not None .因此...的语法if re.search() is not None

Hope it helps!希望能帮助到你!

You need to add a $ sign which signifies the end of a string, you can also add a ^ which is the start of a string so only cat matches:您需要添加一个$符号来表示字符串的结尾,您还可以添加一个^ ,它是字符串的开头,因此只有cat匹配:

 ^cat$

We can try matching one of the three exact words 'cat','dog','monk' in our regex string.我们可以尝试在我们的正则表达式字符串中匹配三个确切的单词 'cat'、'dog'、'monk' 之一。

Our regex string is going to be "\b(?:cat|dog|monk)\b"我们的正则表达式字符串将是"\b(?:cat|dog|monk)\b"

\b is used to define word boundary. \b用于定义单词边界。 We use \b so that we could search for whole words (this is the exact problem you were facing).我们使用\b以便我们可以搜索整个单词(这是您面临的确切问题)。 Adding this would not match tomcat or caterpillar and only cat添加此内容与tomcatcaterpillar和只有cat不匹配

Next, (?:) is called Non capturing group (Explained here )接下来, (?:)被称为非捕获组( 这里解释)

Now we need to match either one of cat or dog or monk .现在我们需要匹配catdogmonk之一。 So this is expressed as cat|dog|monk .所以这表示为cat|dog|monk In python 3 this would be:在 python 3 中,这将是:

import re

words = ['cat', 'caterpillar', 'monkey', 'monk', 'doggy', 'doggo', 'dog']
regex = r"\b(?:cat|dog|monk)\b"

r=re.compile(regex)
matched = list(filter(r.match, words))

print(matched)

To implement matching regex through an iterable list, we use filter function as mentioned in a Stackoverflow answer here要通过可迭代列表实现匹配的正则表达式,我们使用filter function ,如 Stackoverflow 答案中所述

You can find the runnable Python code here您可以在此处找到可运行的 Python 代码

NOTE: Finally, regex101 is a great online tool to try out different regex strings and get their explanation in real-time.注意:最后, regex101是一个很棒的在线工具,可以尝试不同的正则表达式字符串并实时获取它们的解释。 The explanation for our regex string is here我们的正则表达式字符串的解释在这里

You should be using a regular expression ( import re ) , and this is the regular expression you should be using: r'(?<?[A-Za-z0-9])PO(?![A-Za-z0-9])' .您应该使用正则表达式 ( import re ) ,这是您应该使用的正则表达式: r'(?<?[A-Za-z0-9])PO(?![A-Za-z0-9])'

I previously recommended the \b special sequence, but it turns out the '_' is considered part of a word, and that isn't the case for you, so it wouldn't work.我之前推荐了\b特殊序列,但事实证明'_'被认为是单词的一部分,而你的情况并非如此,所以它不起作用。

This leaves you with the somewhat more complex negative look behind and negative lookahead assertions, which is what (?<! ... and (?! ... are, respectively. To understand how those work, read the documentation for Python regular expressions.这给您留下了一些更复杂的负前瞻断言和负前瞻断言,分别是(?<! ... 和(?! ... .

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM