简体   繁体   English

初学者Python:正则表达式和电话号码

[英]Beginners Python: Regex & Phone Numbers

Working my way through a beginners Python book and there's two fairly simple things I don't understand, and was hoping someone here might be able to help. 通过初学者Python书籍,我有两个相当简单的事情,我不明白,并希望有人在这里可以提供帮助。

The example in the book uses regular expressions to take in email addresses and phone numbers from a clipboard and output them to the console. 本书中的示例使用正则表达式从剪贴板中接收电子邮件地址和电话号码,并将它们输出到控制台。 The code looks like this: 代码如下所示:

#! python3
# phoneAndEmail.py - Finds phone numbers and email addresses on the clipboard.

import pyperclip, re

# Create phone regex.
phoneRegex = re.compile(r'''(
(\d{3}|\(\d{3}\))?              #[1] area code
(\s|-|\.)?                      #[2] separator
(\d{3})                         #[3] first 3 digits
(\s|-|\.)                       #[4] separator
(\d{4})                         #[5] last 4 digits
(\s*(ext|x|ext.)\s*(\d{2,5}))?  #[6] extension
)''', re.VERBOSE)

# Create email regex.
emailRegex = re.compile(r'''(
[a-zA-Z0-9._%+-]+   
@                   
[\.[a-zA-Z0-9.-]+   
(\.[a-zA-Z]{2,4})   
)''', re.VERBOSE)

# Find matches in clipboard text.
text = str(pyperclip.paste())           
matches = []                             

for groups in phoneRegex.findall(text):  
    phoneNum = '-'.join([groups[1], groups[3], groups[5]])
    if groups [8] != '':
        phoneNum += ' x' + groups[8]
    matches.append(phoneNum)

for groups in emailRegex.findall(text):
    matches.append(groups[0])           

# Copy results to the clipboard.
if len(matches) > 0:                    
    pyperclip.copy('\n'.join(matches))
    print('Copied to Clipboard:')
    print('\n'.join(matches))
else:
    print('No phone numbers of email addresses found')

Okay, so firstly, I don't really understand the phoneRegex object. 好的,首先,我真的不了解phoneRegex对象。 The book mentions that adding parentheses will create groups in the regular expression. 该书提到添加括号将在正则表达式中创建组。

If that's the case, are my assumed index values in the comments wrong and should there really be two groups in the index marked one? 如果是这种情况,我在评论中假设的索引值是否错误,并且索引中是否真的有两个组标记为一个? Or if they're correct, what does groups[7,8] refer to in the matching loop below for phone numbers? 或者如果它们是正确的,那么群组[7,8]在下面的匹配循环中提到的电话号码是什么?

Secondly, why does the emailRegex use a mixture of lists and tuples, while the phoneRegex uses mainly tuples? 其次,为什么emailRegex使用列表和元组的混合,而phoneRegex主要使用元组?

Edit 1 编辑1

Thanks for the answers so far, they've been helpful. 感谢到目前为止的答案,他们一直很有帮助。 Still kind of confused on the first part though. 尽管如此,第一部分还是有点困惑。 Should there be eight indexes like rock321987's answer or nine like sweaver2112's one? 应该有像rock321987的答案这样的八个索引还是像sweaver2112那样的九个索引?

Edit 2 编辑2

Answered, thank you. 回答,谢谢。

every opening left ( marks the beginning of a capture group, and you can nest them: 每个左边的开口(标记一个捕获组的开头,你可以嵌套它们:

(                               #[1] around whole pattern
(\d{3}|\(\d{3}\))?              #[2] area code
(\s|-|\.)?                      #[3] separator
(\d{3})                         #[4] first 3 digits
(\s|-|\.)                       #[5] separator
(\d{4})                         #[6] last 4 digits
(\s*(ext|x|ext.)\s*(\d{2,5}))?  #[7,8,9] extension
)

You should use named groups here (?<groupname>pattern) , along with clustering only parens (?:pattern) that don't capture anything. 您应该在这里使用命名组 (?<groupname>pattern) ,以及仅捕获不捕获任何内容的parens (?:pattern) And remember, you should capture quantified constructs, not quantify captured constructs: 请记住,您应该捕获量化的构造,而不是量化捕获的构造:

(?<areacode>(?:\d{3}|\(\d{3}\))?)
(?<separator>(?:\s|-|\.)?)
(?<exchange>\d{3})
(?<separator2>\s|-|\.)
(?<lastfour>\d{4})
(?<extension>(?:\s*(?:ext|x|ext.)\s*(?:\d{2,5}))?)
(                               #[1] around whole pattern
(\d{3}|\(\d{3}\))?              #[2] area code
(\s|-|\.)?                      #[3] separator
(\d{3})                         #[4] first 3 digits
(\s|-|\.)                       #[5] separator
(\d{4})                         #[6] last 4 digits
(\s*(ext|x|ext.)\s*(\d{2,5}))?  #[7] extension
    <---------->   <------->
      ^^               ^^
      ||               ||
      [8]              [9]
)

Second Question 第二个问题

You are understanding it entirely wrong. 你理解它是完全错误的。 You are mixing python with regex. 你正在混合python与正则表达式。 In regex 在正则表达式

[] character class (and not list) []字符类(而不是列表)

() capturing group (and not tuple) ()捕获组(而不是元组)

So whatever is inside these have nothing to do with list and tuple in python. 所以内部的任何内容都与python中的listtuple无关。 Regex can be considered itself as a language and () , [] etc. are part of regex 正则表达式可以被视为一种语言, ()[]等是正则表达式的一部分

for the first part of your question see sweaver2112's answer 对于你的问题的第一部分,请参阅sweaver2112的答案

for the second part, the both use lists and tuples. 对于第二部分,使用列表和元组。 In Regex \\d is the same as [0-9] it's just easier to write. 在Regex \\ d中与[0-9]相同,它更容易编写。 in the same vein they could have written \\w for [a-zA-Z] but that wouldn't account for special characters or 0-9 making it a little easier to put [a-zA-Z0-9.-] 同样,他们可以为[a-zA-Z]编写\\ w但是不会考虑特殊字符或0-9使它更容易放[a-zA-Z0-9.-]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM