简体   繁体   English

python regex-可选匹配

[英]python regex - optional match

I have bunch of strings that comes in this flavor: 我有一堆串这种味道的东西:

#q1_a1
#q7

basically # is the sign that has to be ignored. 基本上#是必须忽略的符号。 after #, there comes a single-letter alphabet plus some number. #后面是单字母字母和一些数字。 optionally, some alphabet + number combination can be followed after _ (underbar). 可选地,可以在_ (下划线)后跟随一些字母+数字的组合。

here's what I came up with: 这是我想出的:

>>> pat = re.compile(r"#(.*)_?(.+)?")
>>> pat.match('#q1').groups()
('q1', None)

the problem is strings of #q1_a1 format. 问题是#q1_a1格式的字符串。 when I apply what I made to such strings: 当我将我制作的东西应用于这样的字符串时:

>>> pat.findall('#q1_f1')
[('q1_f1', '')]

any suggestions? 有什么建议么?

As the others have said, the more specific your regex, the less likely it is to match something it shouldn't: 正如其他人所说,您的正则表达式越具体,匹配不该匹配的内容的可能性就越小:

In [13]: re.match(r'#([A-Za-z][0-9])(?:_([A-Za-z][0-9]))?', '#q1_a1').groups()
Out[13]: ('q1', 'a1')

In [14]: re.match(r'#([A-Za-z][0-9])(?:_([A-Za-z][0-9]))?', '#q1').groups()
Out[14]: ('q1', None)

Notes: 笔记:

  1. If you need to only match the entire string, surround the regex with ^ and $ . 如果只需要匹配整个字符串,则用^$包围正则表达式。
  2. You say "some number" but your example only contains a single digit. 您说“一些数字”,但是您的示例仅包含一个数字。 If your regex needs to accept more than one digit, change the [0-9] to [0-9]+ . 如果您的正则表达式需要接受多个数字,请将[0-9]更改为[0-9]+

Your ".*" matches also underscore, as the match is greedy. 您的“。*”匹配项也带有下划线,因为匹配项很贪心。 Better create more specific regex, to exclude underscore from the first group. 更好地创建更具体的正则表达式,以将下划线排除在第一组之外。

Proper regex could look like this: 正确的正则表达式可能如下所示:

#([a-z][0-9])_?([a-z][0-9])?

but you need to check, if it works for all the data you would expect. 但您需要检查它是否适用于您期望的所有数据。

Ps. 附言 Being more specific in regular expressions is better, as you have less false positives. 在正则表达式中更具体一些会更好,因为您的假阳性更少。

When you use .* , it greedy matches as many as possible. 当您使用.* ,它会尽可能地贪婪地匹配。 Try: 尝试:

>>> pat = re.compile(r"#([^_]*)_?(.+)?")
>>> pat.findall('#q1_f1')
[('q1', 'f1')]

As well, it's better to write a more specific expression: 同样,最好写一个更具体的表达式:

#([a-z][0-9])(?:_([a-z][0-9]))?

A simple alternative without using regex: 一个不使用正则表达式的简单替代方法:

s = '#q7'
print s[1:].split('_')
# ['q7']

s = '#q1_a1'
print s[1:].split('_')
# ['q1', 'a1']

This is assuming all of your strings start with # . 假设所有字符串都以#开头。 If that's not the case, then you could easily do some validation: 如果不是这种情况,那么您可以轻松地进行一些验证:

s = '#q1_a1'
if s.startswith('#'):
    print s[1:].split('_')
# ['q1', 'a1]

s = 'q1_a1'
if s.startswith('#'):
    print s[1:].split('_')  # Nothing is printed

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM