简体   繁体   中英

python: how to interrupt a regex match

I iterate over the lines in a large number of downloaded text files and do a regex match on each line. Usually, the match takes less than a second. However, at times a match takes several minutes, sometimes the match does not finish at all and the code just hangs (waited an hour a couple of times, then gave up). Therefore, I need to introduce some kind of timeout and tell the regex match code in some way to stop after 10 seconds or so. I can live with the fact that I will lose the data the regex was supposed to return.

I tried the following (which of course is already 2 different, thread-based solutions shown in one code sample):

def timeout_handler():
    print 'timeout_handler called'

if __name__ == '__main__':
    timer_thread = Timer(8.0, timeout_handler)
    parse_thread = Thread(target=parse_data_files, args=(my_args))
    timer_thread.start()
    parse_thread.start()
    parse_thread.join(12.0)
    print 'do we ever get here ?'

but I do neither get the timeout_handler called nor the do we ever get here ? line in the output, the code is just stuck in parse_data_files .

Even worse, I can't even stop the program with CTRL-C , instead I need to look up the python process number and kill that process. Some research showed that the Python guys are aware of regex C code running away: http://bugs.python.org/issue846388

I did achieve some success using signals:

signal(SIGALRM, timeout_handler)
alarm(8)
data_sets = parse_data_files(config(), data_provider)
alarm(0)

this gets me the timeout_handler called line in the output - and I can still stop my script using CTRL-C . If I now modify the timeout_handler like this:

class TimeoutException(Exception): 
    pass 

def timeout_handler(signum, frame):
    raise TimeoutException()

and enclose the actual call to re.match(...) in a try ... except TimeoutException clause, the regex match actually does get interrupted. Unfortunately, this only works in my simple, single-threaded sandbox script I'm using to try out stuff. There is a few things wrong with this solution:

  • the signal triggers only once, if there is more than one problematic line, I'm stuck on the second one
  • the timer starts counting right there, not when the actual parsing starts
  • because of the GIL, I have to do all the signal setup in the main thread and signals are only received in the main thread; this clashes with the fact that multiple files are meant to be parsed simultaneously in separate threads - there is also only one global timeout exception raised and I don't see how to know in which thread I need to react to it
  • I've read several times now that threads and signals do not mix very well

I have also considered doing the regex match in a separate process, but before I get into that, I thought I'd better check here if anyone has come across this problem before and could give me some hints on how to solve it.

Update

the regex looks like this (well, one of them anyway, the problem occurs with other regexes, too; this is the simplest one):

'^(\\d{5}), .+?, (\\d{8}), (\\d{4}), .+?, .+?,' + 37 * ' (.*?),' + ' (.*?)$'

sample data:

95756, "KURN ", 20110311, 2130, -34.00, 151.21, 260, 06.0, -9999.0, -9999.0, -9999.0, -9999.0, -9999.0, -9999, -9999, 07.0, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -

As said, the regex usually performs ok - I can parse several hundreds of files with several hundreds of lines in less than a minute. That's when the files are complete, though - the code seems to hang with files that have incomplete lines, such as eg

`95142, "YMGD ", 20110311, 1700, -12.06, 134.23, 310, 05.0, 25.8, 23.7, 1004.7, 20.6, 0.0, -9999, -9999, 07.0, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999

I do also get cases where the regex seems to return right away and reports a non-match.

Update 2

I have only quickly read through the catastrophic article , but as far as I can tell so far, that's not the cause - I do not nest any repetition operators.

I'm on Mac OSX, so I can't use RegexBuddy to analyze my regex. I tried RegExhibit (which apparently uses a Perl RegEx engine internally) - and that runs away, too.

You are running into catastrophic backtracking; not because of nested quantifiers but because your quantified characters also can match the separators, and since there are a lot of them, you'll get exponential time in certain cases.

Aside from the fact that it looks more like a job for a CSV parser, try the following:

r'^(\d{5}), [^,]+, (\d{8}), (\d{4}), [^,]+, [^,]+,' + 37 * r' ([^,]+),' + r' ([^,]+)$'

By explicitly disallowing the comma to match between separators, you'll speed up the regex enormously.

If commas may be present inside quoted strings, for example, then just exchange [^,]+ (in places where you'd expect this) with

(?:"[^"]*"|[^,]+)

To illustrate:

Using your regex against the first example, RegexBuddy reports a successful match after 793 steps of the regex engine. For the second (incomplete line) example, it reports a match failure after 1.000.000 steps of the regex engine (this is where RegexBuddy gives up; Python will keep on churning).

Using my regex, the successful match happens in 173 steps, the failure in 174.

Instead of trying to solve the regexp hangup issue with timeouts, maybe it would be worthwhile to consider a completely different kind of approach. If your data really is just comma-separated values, you should get much better performance with the csv -module or just using line.split(",") .

You can't do it with threads. Go ahead with your idea of doing the match in a separate process.

Threading in Python is a weird beast. The Global Interpreter Lock is essentially one big Lock around the interpreter, that means only one thread at a time gets to execute within the interpreter.

Thread scheduling is delegated to the OS. Python essentially signals the OS that another thread may take the lock after a certain number of 'instructions'. So if Python is busy due to a run-away regular expression, it never gets the chance to signal the OS that it may try to take the lock for another thread. Hence the reason for using signals; they are the only way to interrupt.

I'm with Nosklo, go ahead and use separate processes. Or, try to rewrite the regular expression so that it doesn't run away. See the problems associated with backtracking . This may or may not be the cause for the poor regex performance, and changing your regex may not be possible. But if this is the cause and it can be changed, you'll save yourself a whole lot of headache by avoiding multiple processes.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM