简体   繁体   English

python:如何中断正则表达式匹配

[英]python: how to interrupt a regex match

I iterate over the lines in a large number of downloaded text files and do a regex match on each line. 我迭代了大量下载的文本文件中的行,并在每一行上进行正则表达式匹配。 Usually, the match takes less than a second. 通常,匹配不到一秒钟。 However, at times a match takes several minutes, sometimes the match does not finish at all and the code just hangs (waited an hour a couple of times, then gave up). 但是,有时匹配需要几分钟,有时匹配根本没有完成,代码只是挂起(等了一个小时几次,然后就放弃了)。 Therefore, I need to introduce some kind of timeout and tell the regex match code in some way to stop after 10 seconds or so. 因此,我需要引入某种超时并以某种方式告诉正则表达式匹配代码在10秒左右后停止。 I can live with the fact that I will lose the data the regex was supposed to return. 我可以忍受这样一个事实,即我将丢失正则表达式应该返回的数据。

I tried the following (which of course is already 2 different, thread-based solutions shown in one code sample): 我尝试了以下(当然,在一个代码示例中显示了2个不同的,基于线程的解决方案):

def timeout_handler():
    print 'timeout_handler called'

if __name__ == '__main__':
    timer_thread = Timer(8.0, timeout_handler)
    parse_thread = Thread(target=parse_data_files, args=(my_args))
    timer_thread.start()
    parse_thread.start()
    parse_thread.join(12.0)
    print 'do we ever get here ?'

but I do neither get the timeout_handler called nor the do we ever get here ? 但我既没有得到timeout_handler calledtimeout_handler calleddo we ever get here ? line in the output, the code is just stuck in parse_data_files . 在输出中的行,代码只是卡在parse_data_files

Even worse, I can't even stop the program with CTRL-C , instead I need to look up the python process number and kill that process. 更糟糕的是,我甚至无法使用CTRL-C停止程序,而是需要查找python进程号并终止该进程。 Some research showed that the Python guys are aware of regex C code running away: http://bugs.python.org/issue846388 一些研究表明Python人员知道正在运行的正则表达式C代码: http//bugs.python.org/issue846388

I did achieve some success using signals: 我确实使用信号取得了一些成功:

signal(SIGALRM, timeout_handler)
alarm(8)
data_sets = parse_data_files(config(), data_provider)
alarm(0)

this gets me the timeout_handler called line in the output - and I can still stop my script using CTRL-C . 这让我得到输出中timeout_handler called line的timeout_handler called - 我仍然可以使用CTRL-C来停止我的脚本。 If I now modify the timeout_handler like this: 如果我现在修改timeout_handler,如下所示:

class TimeoutException(Exception): 
    pass 

def timeout_handler(signum, frame):
    raise TimeoutException()

and enclose the actual call to re.match(...) in a try ... except TimeoutException clause, the regex match actually does get interrupted. 并且在try ...中包含对re.match(...)的实际调用... except TimeoutException子句except TimeoutException ,正则表达式匹配实际上会被中断。 Unfortunately, this only works in my simple, single-threaded sandbox script I'm using to try out stuff. 不幸的是,这只适用于我用来试用东西的简单的单线程沙箱脚本。 There is a few things wrong with this solution: 这个解决方案有一些问题:

  • the signal triggers only once, if there is more than one problematic line, I'm stuck on the second one 信号只触发一次,如果有多条有问题的线,我就会卡在第二条线上
  • the timer starts counting right there, not when the actual parsing starts 计时器在那里开始计数,而不是在实际解析开始时
  • because of the GIL, I have to do all the signal setup in the main thread and signals are only received in the main thread; 由于GIL,我必须在主线程中进行所有信号设置,并且信号仅在主线程中接收; this clashes with the fact that multiple files are meant to be parsed simultaneously in separate threads - there is also only one global timeout exception raised and I don't see how to know in which thread I need to react to it 这与多个文件同时在不同的线程中被解析的事实发生冲突 - 也只引发了一个全局超时异常,我不知道如何知道我需要对哪个线程作出反应
  • I've read several times now that threads and signals do not mix very well 我已经多次读过线程和信号不能很好地混合

I have also considered doing the regex match in a separate process, but before I get into that, I thought I'd better check here if anyone has come across this problem before and could give me some hints on how to solve it. 我也考虑过在一个单独的过程中进行正则表达式匹配,但在我进入之前,我想我最好先检查一下是否有人之前遇到过这个问题并且可以给我一些关于如何解决它的提示。

Update 更新

the regex looks like this (well, one of them anyway, the problem occurs with other regexes, too; this is the simplest one): 正则表达式看起来像这样(好吧,其中一个,无论如何,问题也出现在其他正则表达式中;这是最简单的一个):

'^(\\d{5}), .+?, (\\d{8}), (\\d{4}), .+?, .+?,' + 37 * ' (.*?),' + ' (.*?)$'

sample data: 样本数据:

95756, "KURN ", 20110311, 2130, -34.00, 151.21, 260, 06.0, -9999.0, -9999.0, -9999.0, -9999.0, -9999.0, -9999, -9999, 07.0, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -

As said, the regex usually performs ok - I can parse several hundreds of files with several hundreds of lines in less than a minute. 如上所述,正则表达式通常表现良好 - 我可以在不到一分钟的时间内解析数百个带有数百行的文件。 That's when the files are complete, though - the code seems to hang with files that have incomplete lines, such as eg 但是,当文件完成时 - 代码似乎挂起了包含不完整行的文件,例如

`95142, "YMGD ", 20110311, 1700, -12.06, 134.23, 310, 05.0, 25.8, 23.7, 1004.7, 20.6, 0.0, -9999, -9999, 07.0, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999

I do also get cases where the regex seems to return right away and reports a non-match. 我也得到了正则表达式似乎立即返回并报告不匹配的情况。

Update 2 更新2

I have only quickly read through the catastrophic article , but as far as I can tell so far, that's not the cause - I do not nest any repetition operators. 我只是快速阅读了灾难性的文章 ,但据我所知,到目前为止,这不是原因 - 我没有嵌套任何重复操作符。

I'm on Mac OSX, so I can't use RegexBuddy to analyze my regex. 我在Mac OSX上,所以我不能使用RegexBuddy来分析我的正则表达式。 I tried RegExhibit (which apparently uses a Perl RegEx engine internally) - and that runs away, too. 我尝试过RegExhibit (它显然在内部使用了Perl RegEx引擎) - 而且它也跑掉了。

You are running into catastrophic backtracking; 你正在遇到灾难性的回溯; not because of nested quantifiers but because your quantified characters also can match the separators, and since there are a lot of them, you'll get exponential time in certain cases. 不是因为嵌套量词,而是因为你的量化字符也可以匹配分隔符,并且因为它们中有很多,所以在某些情况下你会得到指数时间。

Aside from the fact that it looks more like a job for a CSV parser, try the following: 除了它看起来更像是CSV解析器的工作之外,请尝试以下操作:

r'^(\d{5}), [^,]+, (\d{8}), (\d{4}), [^,]+, [^,]+,' + 37 * r' ([^,]+),' + r' ([^,]+)$'

By explicitly disallowing the comma to match between separators, you'll speed up the regex enormously. 通过明确禁止逗号匹配分隔符,您将极大地加速正则表达式。

If commas may be present inside quoted strings, for example, then just exchange [^,]+ (in places where you'd expect this) with 例如,如果逗号可能出现在带引号的字符串中,那么只需交换[^,]+ (在你期望的地方)

(?:"[^"]*"|[^,]+)

To illustrate: 为了显示:

Using your regex against the first example, RegexBuddy reports a successful match after 793 steps of the regex engine. 在第一个示例中使用正则表达式,RegexBuddy在正则表达式引擎的793步之后报告成功匹配。 For the second (incomplete line) example, it reports a match failure after 1.000.000 steps of the regex engine (this is where RegexBuddy gives up; Python will keep on churning). 对于第二个(不完整的行)示例,它在正则表达式引擎的1.000.000步之后报告匹配失败(这是RegexBuddy放弃的地方; Python将继续搅拌)。

Using my regex, the successful match happens in 173 steps, the failure in 174. 使用我的正则表达式,成功匹配发生在173步骤中,失败发生在174。

Instead of trying to solve the regexp hangup issue with timeouts, maybe it would be worthwhile to consider a completely different kind of approach. 可能值得考虑一种完全不同的方法,而不是试图用超时来解决正则表达式挂断问题。 If your data really is just comma-separated values, you should get much better performance with the csv -module or just using line.split(",") . 如果您的数据实际上只是逗号分隔值,那么使用csv -module或只使用line.split(",")可以获得更好的性能。

You can't do it with threads. 你不能用线程来做。 Go ahead with your idea of doing the match in a separate process. 继续你想要在一个单独的过程中进行比赛。

Threading in Python is a weird beast. Python中的线程是一种奇怪的野兽。 The Global Interpreter Lock is essentially one big Lock around the interpreter, that means only one thread at a time gets to execute within the interpreter. 全局解释器锁本质上是解释器周围的一个大锁,这意味着一次只有一个线程可以在解释器中执行。

Thread scheduling is delegated to the OS. 线程调度委托给OS。 Python essentially signals the OS that another thread may take the lock after a certain number of 'instructions'. Python基本上向操作系统发出信号,表示另一个线程可能会在一定数量的“指令”之后锁定。 So if Python is busy due to a run-away regular expression, it never gets the chance to signal the OS that it may try to take the lock for another thread. 因此,如果Python由于失控的正则表达式而忙,它永远不会有机会向操作系统发出信号,表示它可能会试图锁定另一个线程。 Hence the reason for using signals; 因此使用信号的原因; they are the only way to interrupt. 他们是打断的唯一方式。

I'm with Nosklo, go ahead and use separate processes. 我和Nosklo在一起,继续使用不同的流程。 Or, try to rewrite the regular expression so that it doesn't run away. 或者,尝试重写正则表达式,以便它不会逃避。 See the problems associated with backtracking . 查看与回溯相关的问题 This may or may not be the cause for the poor regex performance, and changing your regex may not be possible. 这可能是也可能不是正则表现不佳的原因,并且可能无法更改正则表达式。 But if this is the cause and it can be changed, you'll save yourself a whole lot of headache by avoiding multiple processes. 但如果这是原因而且可以改变,那么通过避免多个过程,你将为自己省去很多麻烦。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM