简体   繁体   中英

Why does my regex work on regexr.com but throws an errorwhen run from command line?

I need to solve two problems with a regex to locate file paths.

1) Main concern: I'm getting an error message I don't understand. 2) Before I changed something small the script would run but the regex search returned nothing.

The regex does work when tested in regexr.com and pythex.org where the matches are correctly located. It doesn't work when I run it from the command line.

Here is the regex I am targeting:

('([a-zA-Z]:\\)([a-zA-Z0-9 ]*\\)*([a-zA-Z0-9 ]*\/)*([a-zA-Z0-9 ])*(\.[a-zA-Z]*)*'

Here is the code is its used within:

import os
import re

#run script from directory the script is in - place it in the dir being processed
start_path = os.path.dirname(os.path.realpath(__file__))
metadata_path = start_path + "\Metadata"

#change directory to the metadata folder where email.txt is
try:
    os.chdir(metadata_path)
except: print ('Could not change directory. Please try again.')

with open("email.txt", 'r', encoding = 'utf-8') as file:
    all_lines = file.readlines()
    no_header = all_lines[5:] #remove the header lines from email.txt
new_lines =[]
all_files=[]
unique_files =[]
for i in range(len(no_header)):#remove square charcter
    new_lines.append(re.sub('\S\-\d+', '',no_header[i]))

for i in range(len(new_lines)):#capture all the names of files containing personal emails
    test = re.search('([a-zA-Z]:\\)([a-zA-Z0-9 ]*\\)*([a-zA-Z0-9 ]*\/)*([a-    
    zA-Z0-9 ])*(\.[a-zA-Z]*)*',new_lines[i]) 
    print (test)

I am getting the error message 're.error: missing ), unterminated subpattern at position 0'

It has an even amount of parentheses which seem to match each other as far as I can see. I am guessing that this has something to do with how I have grouped things in the pattern.

As far as it returning nothing, am I missing a python specific rule that the online testers aren't catching?

Thanks!

My guess is that it might be missing r maybe or parentheses somewhere in the expression:

Test

import re

regex = r"([a-zA-Z]:\\)([a-zA-Z0-9 ]*\\)*([a-zA-Z0-9 ]*\/)*([a-zA-Z0-9 ])*(\.[a-zA-Z]*)*"

test_str = "a:\\a\\a/a.a"

print(re.search(regex, test_str))

The expression is explained on the top right panel of regex101.com , if you wish to explore/simplify/modify it, and in this link , you can watch how it would match against some sample inputs, if you like.

Code

import os
import re

#run script from directory the script is in - place it in the dir being processed
start_path = os.path.dirname(os.path.realpath(__file__))
metadata_path = start_path + "\Metadata"

#change directory to the metadata folder where email.txt is
try:
    os.chdir(metadata_path)
except: print ('Could not change directory. Please try again.')

with open("email.txt", 'r', encoding = 'utf-8') as file:
    all_lines = file.readlines()
    no_header = all_lines[5:] #remove the header lines from email.txt
new_lines =[]
all_files=[]
unique_files =[]
for i in range(len(no_header)):#remove square charcter
    new_lines.append(re.sub(r'\S\-\d+', '',no_header[i]))

for i in range(len(new_lines)):#capture all the names of files containing personal emails
    test = re.search(r'([a-zA-Z]:\\)([a-zA-Z0-9 ]*\\)*([a-zA-Z0-9 ]*\/)*([a-    
    zA-Z0-9 ])*(\.[a-zA-Z]*)*',new_lines[i]) 
    print (test)

This is because of \\\\ characters (columns 12 and 29), they are interpreted in python as a single \\ which then esacpes the following ) in your regex. The easiest way to fix this is to "double-espace" your backslashes :

'([a-zA-Z]:\\\\)([a-zA-Z0-9 ]*\\\\)*([a-zA-Z0-9 ]*\/)*([a-zA-Z0-9 ])*(\.[a-zA-Z]*)*'

It's ugly but does the job.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM