简体   繁体   中英

how can i correct this Regex phone number extractor in python

The results i'm getting when i run this after copying a set of UK phone numbers to the clipboard are coming out in a very bizarre kind of way. (i have imported both modules before you ask)

phoneRegex = re.compile(r'''(
    (\d{5}|\(\d{5}\))?       #area code
    (\s|-|\.)?               #separator
    (\d{6})                  #main 6 digits
    )''', re.VERBOSE)

emailRegex = re.compile(r'''(
    [a-zA-Z0-9._%+-]+        #username
    @                        #obligatory @ symbol
    [a-zA-Z0-9.-]+           #domain name
    (\.[a-zA-Z]{2,5})        #dot-something
    )''', re.VERBOSE)

text = str(pyperclip.paste())

matches = []
for groups in phoneRegex.findall(text):
    phoneNum = '-'.join([groups[1], groups[3]])
    if groups[3] != '':
        phoneNum += ' ' + groups[3]
    matches.append(phoneNum)
for groups in emailRegex.findall(text):
    matches.append(groups[0])
if len(matches) > 0:
    pyperclip.copy('\n'.join(matches))
    print('Copied to clipboard:')
    print('\n'.join(matches))
else:
    print('No phone numbers or email addresses found')

The mistake is somewhere here:

for groups in phoneRegex.findall(text):
    phoneNum = '-'.join([groups[1], groups[3]])
    if groups[3] != '':
        phoneNum += ' ' + groups[3]
    matches.append(phoneNum)

The numbers copied to clipboard:
07338 754433
01265768899
(01283)657899

Expected results:
Copied to clipboard:
07338 754433
01265 768899
01283 657899

return results:
Copied to clipboard:
07338-754433 754433
-012657 012657
(01283)-657899 657899

I see three issues:

  1. the python code joins the two parts of the phone number together with a - and then adds a space and the third part again:

      phoneNum = '-'.join([groups[1], groups[3]]) if groups[3] != '': phoneNum += ' ' + groups[3] 

    Since groups[3] will always not be blank, what you need to do is:

      if groups[1] != '': phoneNum = ' '.join(groups[1], groups[3]) else: phoneNum = groups[3] 
  2. Your phoneRegex regular expression is not anchored to the beginning and end of the lines. You need to (a) compile it with the re.MULTILINE option and (b) anchor the regular expression between ^ and $ :

     phoneRegex = re.compile(r'''^( (\\d{5}|\\(\\d{5}\\))? #area code (\\s|-|\\.)? #separator (\\d{6}) #main 6 digits )$''', re.VERBOSE + re.MULTILINE) 

    This will prevent a long string of digits with no separator as being just group 3 with a bunch of digits after it.

  3. Your match for the area code includes the matched parentheses within the group match. To fix this, you either need to change the regular expression to make sure the parentheses are not part of the group, or you need to change your code to strip the parentheses out if needed.

    • Avoid parentheses in the regular expression:

        (?:(\\d{5})|\\((\\d{5})\\))? #area code 

      The (?:...) is a non-grouping form of parentheses, so it won't be returned by the find. Within that, you have two alternatives: 5 digits in a group - (\\d{5}) - or literal parentheses that enclose 5 digits in a group - \\((\\d{5})\\) .

      However, this change also affects your phone number recombination logic, because your area code is either in groups[1] or groups[2] , and your main number is now in groups[4] .

        if groups[1] != '': phoneNum = ' '.join(groups[1], groups[4]) elif groups[2] != '': phoneNum = ' '.join(groups[2], groups[4]) else: phoneNum = groups[4] 
      • This could be made easier by changing the outer set of parentheses and the parentheses around the separator into non-grouping parentheses. You could then do a single join on a filtered result of the groups:

         phoneRegex = re.compile(r'''(?: (?:(\\d{5})|\\((\\d{5})\\))? #area code (?:\\s|-|\\.)? #separator (\\d{6}) #main 6 digits )''', re.VERBOSE) # ... phoneNum = ' '.join([group for group in groups if group != '']) 

        The modified phoneRegex ensures that the returned groups contain only an optional area code in groups[0] or groups[1] followed by the main number in groups[2] , no extraneous matches returned. The code then filters out any groups that are empty and returns the rest of the groups joined by a space.

    • Strip parentheses in code:

        if groups[1] != '': phoneNum = ' '.join(groups[1].lstrip('(').rstrip(')'), groups[3]) else: phoneNum = groups[3] 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM