简体   繁体   English

我如何在python中更正此Regex电话号码提取器

[英]how can i correct this Regex phone number extractor in python

The results i'm getting when i run this after copying a set of UK phone numbers to the clipboard are coming out in a very bizarre kind of way. 将一组英国电话号码复制到剪贴板后,运行此程序时得到的结果以一种非常奇怪的方式出现。 (i have imported both modules before you ask) (在您询问之前,我已经导入了两个模块)

phoneRegex = re.compile(r'''(
    (\d{5}|\(\d{5}\))?       #area code
    (\s|-|\.)?               #separator
    (\d{6})                  #main 6 digits
    )''', re.VERBOSE)

emailRegex = re.compile(r'''(
    [a-zA-Z0-9._%+-]+        #username
    @                        #obligatory @ symbol
    [a-zA-Z0-9.-]+           #domain name
    (\.[a-zA-Z]{2,5})        #dot-something
    )''', re.VERBOSE)

text = str(pyperclip.paste())

matches = []
for groups in phoneRegex.findall(text):
    phoneNum = '-'.join([groups[1], groups[3]])
    if groups[3] != '':
        phoneNum += ' ' + groups[3]
    matches.append(phoneNum)
for groups in emailRegex.findall(text):
    matches.append(groups[0])
if len(matches) > 0:
    pyperclip.copy('\n'.join(matches))
    print('Copied to clipboard:')
    print('\n'.join(matches))
else:
    print('No phone numbers or email addresses found')

The mistake is somewhere here: 错误在这里:

for groups in phoneRegex.findall(text):
    phoneNum = '-'.join([groups[1], groups[3]])
    if groups[3] != '':
        phoneNum += ' ' + groups[3]
    matches.append(phoneNum)

The numbers copied to clipboard: 复制到剪贴板的数字:
07338 754433 07338 754433
01265768899 01265768899
(01283)657899 (01283)657899

Expected results: 预期成绩:
Copied to clipboard: 复制到剪贴板:
07338 754433 07338 754433
01265 768899 01265 768899
01283 657899 01283 657899

return results: 返回结果:
Copied to clipboard: 复制到剪贴板:
07338-754433 754433 07338-754433 754433
-012657 012657 -012657 012657
(01283)-657899 657899 (01283)-657899 657899

I see three issues: 我看到三个问题:

  1. the python code joins the two parts of the phone number together with a - and then adds a space and the third part again: python代码将电话号码的两个部分与-结合在一起,然后再添加一个空格和第三部分:

      phoneNum = '-'.join([groups[1], groups[3]]) if groups[3] != '': phoneNum += ' ' + groups[3] 

    Since groups[3] will always not be blank, what you need to do is: 由于groups[3]始终不会为空,因此您需要做的是:

      if groups[1] != '': phoneNum = ' '.join(groups[1], groups[3]) else: phoneNum = groups[3] 
  2. Your phoneRegex regular expression is not anchored to the beginning and end of the lines. 您的phoneRegex正则表达式未锚定到行的开头和结尾。 You need to (a) compile it with the re.MULTILINE option and (b) anchor the regular expression between ^ and $ : 您需要(a)使用re.MULTILINE选项进行编译,并且(b)将正则表达式锚定在^$之间:

     phoneRegex = re.compile(r'''^( (\\d{5}|\\(\\d{5}\\))? #area code (\\s|-|\\.)? #separator (\\d{6}) #main 6 digits )$''', re.VERBOSE + re.MULTILINE) 

    This will prevent a long string of digits with no separator as being just group 3 with a bunch of digits after it. 这将防止没有分隔符的一长串数字只是在第3组后面紧跟一串数字。

  3. Your match for the area code includes the matched parentheses within the group match. 您的区号匹配项在组匹配项中包括了匹配的括号。 To fix this, you either need to change the regular expression to make sure the parentheses are not part of the group, or you need to change your code to strip the parentheses out if needed. 要解决此问题,您需要更改正则表达式以确保括号不属于该组,或者需要更改代码以去除括号。

    • Avoid parentheses in the regular expression: 在正则表达式中避免使用括号:

        (?:(\\d{5})|\\((\\d{5})\\))? #area code 

      The (?:...) is a non-grouping form of parentheses, so it won't be returned by the find. (?:...)是括号的非分组形式,因此查找将不会返回。 Within that, you have two alternatives: 5 digits in a group - (\\d{5}) - or literal parentheses that enclose 5 digits in a group - \\((\\d{5})\\) . 在其中,您有两个选择:组中的5位数字- (\\d{5}) -或将组中的5位数字括起来的文字括号- \\((\\d{5})\\)

      However, this change also affects your phone number recombination logic, because your area code is either in groups[1] or groups[2] , and your main number is now in groups[4] . 但是,这种变化也影响你的电话号码重组逻辑,因为你的区域代码groups[1]groups[2]现在您的主号码是在groups[4]

        if groups[1] != '': phoneNum = ' '.join(groups[1], groups[4]) elif groups[2] != '': phoneNum = ' '.join(groups[2], groups[4]) else: phoneNum = groups[4] 
      • This could be made easier by changing the outer set of parentheses and the parentheses around the separator into non-grouping parentheses. 可以通过将外部圆括号和分隔符周围的圆括号更改为非分组圆括号来简化此操作。 You could then do a single join on a filtered result of the groups: 然后,您可以对组的过滤结果进行单个联接:

         phoneRegex = re.compile(r'''(?: (?:(\\d{5})|\\((\\d{5})\\))? #area code (?:\\s|-|\\.)? #separator (\\d{6}) #main 6 digits )''', re.VERBOSE) # ... phoneNum = ' '.join([group for group in groups if group != '']) 

        The modified phoneRegex ensures that the returned groups contain only an optional area code in groups[0] or groups[1] followed by the main number in groups[2] , no extraneous matches returned. 修改后的phoneRegex可以确保返回的groups仅在groups[0]groups[1]包含可选的区号,然后在groups[2]包含主号码,而不会返回多余的匹配项。 The code then filters out any groups that are empty and returns the rest of the groups joined by a space. 然后,代码将筛选出所有空的组,并返回由空格连接的其余组。

    • Strip parentheses in code: 在代码中去除括号:

        if groups[1] != '': phoneNum = ' '.join(groups[1].lstrip('(').rstrip(')'), groups[3]) else: phoneNum = groups[3] 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM