简体   繁体   English

正则表达式 | 空字符串匹配 | Python 3.4.0

[英]Regex | Empty String match | Python 3.4.0

My Code:我的代码:

import re

#Phone Number regex
phoneRegex = re.compile(r'''(
(\d{3}|\(\d{3}\))?
(\s|-|\.)?                       # separator
(\d{3})                          # first 3 digits
(\s|-|\.)                        # separator
(\d{4})                          # last 4 digits
(\s*(ext|x|ext.)\s*(\d{2,5}))?   # extension
)''', re.VERBOSE)

phoneRegex.findall('Phone: 800.420.7240 or +1 415.863.9900 (9 a.m. to 5 p.m., M-F, PST)')

Output: Output:

[('800.420.7240', '800', '.', '420', '.', '7240', '', '', ''), ('415.863.9900', '415', '.', '863', '.', '9900', '', '', '')]

Questions:问题:

  1. Why are empty strings included in the match?为什么匹配中包含空字符串?
  2. The empty strings are matched from what positions of the string?空字符串从字符串的哪些位置匹配?
  3. What are the conditions for empty strings to be matched?空字符串匹配的条件是什么?

PS The empty strings are not included in the match when I use the same regex on https://regex101.com/ PS当我在https://regex101.com/上使用相同的正则表达式时,匹配中不包含空字符串
Also, I just started learning regex a few days ago, so I'm sorry if my questions aren't good enough.另外,我几天前才开始学习正则表达式,所以如果我的问题不够好,我很抱歉。

The ? ? operator means that it will return zero or one matches.运算符表示它将返回零个或一个匹配项。 In this case, you've made some capturing groups optional with ?在这种情况下,您使用 ? 使一些捕获组成为可选的? , and python is returning a zero-length match for each of the three optional capturing groups you created. , 和 python 为您创建的三个可选捕获组中的每一个返回一个零长度匹配项。

If you remove the first two ?如果删除前两个? you'll eliminate some zero-length matches.您将消除一些零长度匹配。 To deal with the last one, you need to change the extension pattern.要处理最后一个,您需要更改扩展模式。 It accounts for two, again because you use a zero or one operator ( * ).它占两个,再次因为您使用零或一个运算符 ( * )。

If you don't care about the internal capturing groups and just want the full match, you could filter out the zero-length matches with something like如果您不关心内部捕获组并且只想要完整匹配,您可以过滤掉零长度匹配,例如

>>> [match.group(0) for match in phoneRegex.finditer('Phone: 800.420.7240 or +1 415.863.9900 (9 a.m. to 5 p.m., M-F, PST)')]
['800.420.7240', '415.863.9900']

You could make the extension capture group match conditional on there being a match for a preceding phone number.您可以根据前面的电话号码匹配条件来使扩展捕获组匹配。 Also, I think you may need to escape the .另外,我认为您可能需要逃避. in the third alternative ext.在第三个备选ext. . . As written it matches any character, but I think what you meant was ext\.正如所写,它匹配任何字符,但我认为你的意思是ext\. . .

for reference:以供参考:

Why are empty strings included in the match?为什么匹配中包含空字符串? Because you have used various groups in your regex.因为您在正则表达式中使用了各种组。 The engine captures the part of the match that you have put into a group.引擎会捕获您已放入组中的匹配部分。

The empty strings are matched from what positions of the string?空字符串从字符串的哪些位置匹配? From this regex: (\s*(ext|x|ext.)\s*(\d{2,5}))?从这个正则表达式: (\s*(ext|x|ext.)\s*(\d{2,5}))? It has three groups (you can count the opening parentheses).它有三组(你可以数左括号)。 The engine does not find a match with extensions and the 3 groups that try to capture information return empty strings.引擎找不到与扩展名匹配的内容,并且尝试捕获信息的 3 个组返回空字符串。

What are the conditions for empty strings to be matched?空字符串匹配的条件是什么? If you group your regex in a way that the engine catches an empty substring in the string matched, it will return empty strings:-)如果您以引擎在匹配的字符串中捕获空 substring 的方式对正则表达式进行分组,它将返回空字符串:-)

I think you are following the exercise from "automate the boring stuff with python".我认为您正在遵循“使用 python 自动化无聊的东西”的练习。 In the regex in VERBOSE mode on page 178, try to look for the opening parentheses.在第 178 页的 VERBOSE 模式下的正则表达式中,尝试查找左括号。 Where is the closing parentheses?右括号在哪里? The number of groups is equivalent with the number of opening parentheses.组的数量与左括号的数量相等。 The whole regex is group number zero.整个正则表达式是组号为零。

Groups are useful if you want to extract certain parts of a matched string.如果要提取匹配字符串的某些部分,组很有用。 If you only want to extract full phone numbers, leave the groups away.如果您只想提取完整的电话号码,请远离群组。

You may try just this:你可以试试这个:

phoneRegex = re.compile(r'\d{3}[\.|-|\/]\d{3}[\.|-|\/]\d{4}')

Is this what you want to achieve?这是你想要达到的目标吗?

If you want to stick with your regex in VERBOSE mode, you could also use non-capturing groups.如果您想在 VERBOSE 模式下坚持使用您的正则表达式,您也可以使用非捕获组。 This only captures a full match:这仅捕获完整匹配:

phoneRegex = re.compile(r'''(
(?:\d{3}|\(?:\d{3}\))?
(?:\s|-|\.)?                       # separator
(?:\d{3})                          # first 3 digits
(?:\s|-|\.)                        # separator
(?:\d{4})                          # last 4 digits
(?:\s*(?:ext|x|ext.)\s*(?:\d{2,5}))?   # extension
)''', re.VERBOSE)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM